maybe put some SSE or AVX in there somehow
https://en.cppreference.com/w/cpp/experimental/simdA zero-overhead abstraction for the high-level language you are already using is so much nicer than spending a bunch of time writing your own architecture-specific routines in assembly or compiler intrinsics (std::simd itself being a simple template library implemented using intrinsics). Generally speaking at least