Which version and from where? Use the latest version from
GitHub. It doesn’t go beyond 4.80 Mkeys/s per core, which is the same as in the original version.
I am getting 4.61 Mkeys/s on Linux and 4.70 Mkeys/s (CPU Threads: 1) on Windows on the same machine with the latest version.

The compiler used on Windows might be different from the one used on Linux. Different compilers can produce code with varying levels of optimization, which can impact performance.
I am achieving 6 Mkeys/s per core with /opt/AMD/aocc-compiler-5.0.0/bin/clang++.
AOCC is specifically optimized for AMD hardware and can leverage AMD-specific features and instructions to deliver better performance.
