Re: An (even more) optimized version of cpuminer (pooler's cpuminer, CPU-only)

hi pooler,
are you still monitoring this thread?

hi all,
recently I tried getting cpuminer to run on a Raspberry Pi 4 (aarch64)
I'm using somewhat older codes (version 2.4) from
https://github.com/pooler/cpuminer

I couldn't figure out how to get it to build with the ARM assembly codes, and apparently scrypt-arm.S seemed to be written for armhf (32 bit ARM microprocessors).
Hence, there could possibly with issues compiling in aarch64 (ARM 64bit instructions and OS)

Among the things I tried, I added the following flags

Code:

-O2 -ftree-vectorize -ftree-slp-vectorize -ftree-loop-vectorize -ffast-math -ftree-vectorizer-verbose=7 -funsafe-loop-optimizations -funsafe-math-optimizations

I checked and try compiling with "-S" option which makes it generate assembly codes, apparently among the suite of flags used above, it causes GCC to generate assembly codes with NEON SIMD.
This is without specific hand optimized assembly. It may possibly still make NEON assembly with a few less flags (e.g. possibly less -ftree-slp-vectorize, but that I think this is useful even without NEON), but that when missing some of the above flags, NEON assembly isn't generated.

I tried with and without the above flags e.g. just -O2, there is at least a slight difference in hash rates, from about 4 khash per sec on all 4 cores doing Neoscrypt - mining Feathercoin to about 5+ khash per sec with NEON optimised codes by GCC's -ftree-vectorize vectorizer, some 20-30% improvements. And the cpu runs hotter during mining which indicates an improvement in efficiency. This is probably a useful thing to have around as manually writing hand optimized assembly e.g. for scrypt-arm.S would likely take a lot of effort and is likely less portable. granted, -ftree-vectorize won't make the fastest codes, but that the improvement is decent with much less manual efforts needed to make optimized assembly codes.

It used to be that Raspberry Pis are deemed 'too slow' to do mining but Raspberry Pi4 with A72 ARM cores are just borderline and 'punch above its weight' to mine alongside the big Mhash per seconds gpus, the differences is easily 1:1000 though.