Re: An (even more) optimized version of cpuminer (pooler's cpuminer, CPU-only)

Quote from: ag1233 on Today at 04:02:23 PM

I did one more experiment doing salsa20 with neon Intrinsics
https://arm-software.github.io/acle/neon_intrinsics/advsimd.html
'hand optimized' but naive
run salsa20 with a million loops
and it turns out gcc's -ftreevectorize did 1 million loops in 11 ms original C codes

then rearranged arrays into 'lanes' with -ftreevectorize did 1 million loops in 80 ms (but varies dependent on the cache), 2nd run tends to be faster
this version writes out arrays to memory during permutation. lots of wait states and stalls.

and naive 'hand optimized' salsa20 with neon and all takes 59 ms for 1 million loops

this kind of means that it isn't true -ftreevectorize is slow, in that it sometimes beat 'naive' hand optimized codes. it is probably close to being a best optimized codes, but that the generated assembly is practically unreadable. generated by machines for machines.

Original code was faster than parallel, no surprise.

It looks like your comparing original code with -ftreevectorize to "hand coded" with -ftreevectorize. That doesn't prove anything about -ftreevectorize.
You need to test the same code with and without vectorization. Did your hand coded version actually use parallel SIMD Salsa on the data "arranged in lanes"?