@nomachine
For my local 9950X and 16 threads I am getting ~100 Mkeys/s.
For vast.ai server with AMD EPYC 7642 48-Core 186 threads (up to 192) I am getting 222 Mkeys/s.
Seems like for greater number of threads it is not working properly.
I would expect much more keys/s.
I am currently using and testing AVX-256, and it is slower than AVX-512 by half. The AVX-256 workload here can cause the CPU to downclock significantly, which might negate the performance gains from vectorization. The CPU is throttling as a result.
You should switch to AVX-512 with EPYC.
Uploaded optimized ripemd160_avx2.cpp (AVX-256) to Git.