blurb:
my guess is GPUs and maybe Intel, AMD probably has a different hardware architecture that can drastically speed up loads and stores to / from the vector registers.
raspberry pi that broadcom a72 cpu uses an external dram that has a 16 bits bus, there is little that can be done about it as it needs to optimise for 'cheapness' and space and ease of manufacturing.
I think some GPUs and 'high end' CPUs has more like 64 bits or even 128 bits or more data buses which can possibly load/store 2x64 or more words or more in a single cycle.
at just 16 bits dram data bus even a single 64 bits words takes 4 cycles just to get or store that word and one need to add all the memory wait states which means more cycles.
i avoided that in my 'hand coded assembly' and simply used cpu registers, but that it did not overcome the stalls writing to / reading from neon registers.
for a comparison, without optimization i.e. no -O2 and/or -ftreevectorize flags and simply using that 'rearranged arrays salsa to look like lanes', so that it writes to and read from memory on the raspberry pi.
it takes > 1 seconds for that same 1048576 rounds, all that optimizations is then hundreds of times faster than if it isn't optimised, that is even true vs the naive neon simd.
Good stuff.
If you're just playing around to learn I suggest you take a look at the Blake family of algorithms, I doesn't have memory issues that Salsa does. It also supports
both linear and parallel vector coding optimizations, even using both together. Linear typically requires some cross laning but doesn't increase memory usage, parallel
doesn't require cross laning but memory requirements scale with the number of parallel data lanes and is very sensitive to data dependant addressing.