Post
Topic
Board Mining (Altcoins)
Re: An (even more) optimized version of cpuminer (pooler's cpuminer, CPU-only)
by
JayDDee
on 12/09/2023, 17:00:32 UTC
Hence, for now the 'easy' way is to simply -ftree-vectorize with the other flags in bundle so that at least some form of NEON SIMD is achieved.
There is a decent gain like 20% (for Neoscrypt) with vs without the compiler generated SIMD codes.

Very few of those gains are in the hashing code. There are SIMD versions of Salsa20 available used in most implementations of scrypt and neoscrypt
but they are all hand coded. Most of the gains from compiler optimizing are from loops with no dependencies.

Parallel SIMD requires data locality to be efficient as you observed. CPU memory doesn't do well with scattered memory access.
Neoscrypt, if that is your real interest, doesn't do well with parallel hashing due to Salsa20 so it would be futile to try it.
Pooler scrypt code uses multiple buffering for Salsa20 but not it's not parallel. The Salsa20 SIMD Is all single stream.

I tried various forms of parallel scrypt with poor results. The remains of those efforts still exist in my code.