In this benchmark, you get an average of 4 Mkeys/s per core, possibly because this version uses a fully random approach, which impacts the speed.
The biggest performance gain comes from generating eight random numbers simultaneously using AVX-512.
__m512i randVec = _mm512_set_epi64x(rng1, rng2, rng3, rng4, rng5, rng6, rng7, rng8);
Of course, I won't publicly disclose the exact details of how this works.
