You say:
I can currently squeeze out 6.2 Gk/s on a RTX 4090, but some users here claim they can obtain 8 Gk/s or more
RetiredCoder say:
Note that I have not included all possible optimizations because it's public code and I want to keep it as simple/readable as possible.
Git RCKangaroo say:
about 8GKeys/s on RTX 4090.
You say:
NB regarding RCKangaroo - it runs 1.5x slower than my kernel.
How can this be understood? You were able to reach 12GKeys/s on RTX 4090?
Apples and oranges... I use a lame RTX 3050 Laptop GPU for development, and only test once in a while on high-grade GPUs.
JLP code: 280 MK/s
RCKang: 441 MK/s
My kernel: 690 MK/s (with a much less memory usage and much less total kangaroos)
However, my focus was always on optimizing the computations. RTX 3050 only has 1 MB of L2 cache, while RTX 4090 has 100 MB, so his tweaks don't do wonders for lower-end GPUs.
However, I did not yet added any L2 memory tweaks like in RCKang, so I will try to do that when I have some time and test again on RTX 4090 to see whether there is an improvement. FYI my use-case is squeezing in as much kangaroos as possible WITHOUT having to thrash any variables up to global memory, and this means relying much more on utilizing the L1 cache as most optimal as possible.