I did one more experiment doing salsa20 with neon Intrinsics
https://arm-software.github.io/acle/neon_intrinsics/advsimd.html'hand optimized' but naive
run salsa20 with a million loops
and it turns out gcc's
-ftreevectorize did 1 million loops in 11 ms original C codes
then rearranged arrays with
-ftreevectorize did 1 million loops in 80 ms (but varies dependent on the cache), 2nd tends to be faster
this version writes out arrays to memory during permutation. lots of wait states and stalls.
and naive 'hand optimized' salsa20 with neon and all takes 59 ms for 1 million loops
this kind of means that it isn't true
-ftreevectorize is slow, in that it sometimes beat 'naive' hand optimized codes. it is probably close to being a best optimized codes, but that the generated assembly is practically unreadable. generated by machines for machines.