Post
Topic
Board Mining (Altcoins)
Re: An (even more) optimized version of cpuminer (pooler's cpuminer, CPU-only)
by
ag1233
on 12/09/2023, 16:04:22 UTC
thank JayDDee, there is a minor gain with -ftree-vectorize for Neoscrypt as it spend a large number of loops in salsa20, 1000 x some 200 rounds?
hence, NEON SIMD could potentially speed that up significantly, the trouble is that salsa20
https://en.wikipedia.org/wiki/Salsa20
permutates, the arrays between the quarter rounds in each loop. I did a naive attempt by simply re-arrange the arrays in C codes so that they looked like they fall into 'lanes' (common to SIMD).
that oversimplified approach don't cut it with -ftree-vectorize, the registers get streamed out into memory (lots of wait states and cpu stalls for the small Raspberry Pi type boards and cpus).
but that hand optimized assembly won't be easy to write and that they'd take quite a lot of effort.
and the thing is this won't be the only thing that needs to be optimized.

Hence, for now the 'easy' way is to simply -ftree-vectorize with the other flags in bundle so that at least some form of NEON SIMD is achieved.