Re: An (even more) optimized version of cpuminer (pooler's cpuminer, CPU-only)

Quote from: ag1233 on Today at 02:59:40 PM

thanks pooler, I'm thinking you may want to add an 'option' or document the flags mentioned, I think we'd leave the challenge of hand optimizing part of that code to some other time or if someone may want to take up the challenge.

By just using those flags mentioned, gcc builds binaries with NEON SIMD using that -ftree-vectorize flag, along with the other flags as otherwise it doesn't turn on SIMD codes.
This is a 'quick and easy' way to at least get some NEON SIMD on aarch64 and it isn't too bad as i've described.

-ftreevectorize isn't that useful for hash code. Most of the SIMD gains are from hashing multiple nonces in parallel wihin a single CPU thread. The compiler can't do that,
it must be hand coded. See sha256d ASM for an example. All the SIMD code is in support of hashing nonces in parallel outside the scope of any compiler optimizing.

Scrypt Salsa could theoretically be optimized by the compiler but only if the compiler was written to recognize the code as Salsa. That would be a stretch. The hand
written ASM does what the compiler can't.

What's really needed is to write NEON ASM by hand to do parallel hashing. However, that's not feasible with most CPUs now supporting HW accelerated sha256.
Writing ARM 64 bit SIMD for sha256 would be nothing more than an academic exercise.

cpuminer-opt has good examples of parallel sha256 using 128, 256 & 512 bit x86 SIMD as well as a HW accelerated version using intrinsics which are a little more readable
tham ASM.