No, I don't mean the speed itself, the problem is in DPs.
To demonstrate it, I need to know best speed parameters for 4090 with JLP's app.
When I start Kangaroo.exe with default parameters it shows "Grid(256x128)" and speed is about 1.8GH. I remember that speed can be better if use bigger grid, but don't remember best values. I think nobody uses it with default grid size if there are better values.
If you know them, I will explain the problem with exact numbers.
Optimal grid size in JLP's implementation is a hide and seek game. That's because theoretically optimal size results in much degraded slowdown. There's also a huge amount of kangaroos computed first (gridSize x 512) and you can end up with resource overload.
Since his kernel compiles to 96 registers (ccap 8.9 for RTX 4090) maybe try a grid like (128 * 2) x 512 so each SM fits 2 blocks each of 512 threads. That's close to 70 million kangaroos.
What problem was in DPs, except for the fact that there's all the chances in the world they might be too many of them and get lost?
So 128*2 * 512 = 131K threads and each thread processes a group of 128 kangs, right? It's 16.8M kangs, correct? And speed is about 3GH?