I did one more experiment doing salsa20 with neon Intrinsics
https://arm-software.github.io/acle/neon_intrinsics/advsimd.html'hand optimized' but naive
run salsa20 with a million loops
and it turns out gcc's
-ftreevectorize did 1 million loops in 11 ms original C codes
then rearranged arrays into 'lanes' with
-ftreevectorize did 1 million loops in 80 ms (but varies dependent on the cache), 2nd run tends to be faster
this version writes out arrays to memory during permutation. lots of wait states and stalls.
and naive 'hand optimized' salsa20 with neon and all takes 59 ms for 1 million loops
this kind of means that it isn't true
-ftreevectorize is slow, in that it sometimes beat 'naive' hand optimized codes. it is probably close to being a best optimized codes, but that the generated assembly is practically unreadable. generated by machines for machines.
Original code was faster than parallel, no surprise.
It looks like your comparing original code with -ftreevectorize to "hand coded" with -ftreevectorize. That doesn't prove anything about -ftreevectorize.
You need to test the same code with and without vectorization. Did your hand coded version actually use parallel SIMD Salsa on the data "arranged in lanes"?