thanks pooler, I'm thinking you may want to add an 'option' or document the flags mentioned, I think we'd leave the challenge of hand optimizing part of that code to some other time or if someone may want to take up the challenge.
By just using those flags mentioned, gcc builds binaries with NEON SIMD using that -ftree-vectorize flag, along with the other flags as otherwise it doesn't turn on SIMD codes.
This is a 'quick and easy' way to at least get some NEON SIMD on aarch64 and it isn't too bad as i've described.