In this benchmark, you get an average of 4 Mkeys/s per core, possibly because this version uses a fully random approach, which impacts the speed.
The biggest performance gain comes from generating eight random numbers simultaneously using AVX-512.
__m512i randVec = _mm512_set_epi64x(rng1, rng2, rng3, rng4, rng5, rng6, rng7, rng8);
Of course, I won't publicly disclose the exact details of how this works.

So wait, then by that logic, AVX-256 can generate four random numbers per core for 64-bit integers ?