Post
Topic
Board Development & Technical Discussion
Re: 5-7 kangaroo method
by
kTimesG
on 20/10/2024, 16:33:45 UTC
No, I don't mean the speed itself, the problem is in DPs.
To demonstrate it, I need to know best speed parameters for 4090 with JLP's app.
When I start Kangaroo.exe with default parameters it shows "Grid(256x128)" and speed is about 1.8GH. I remember that speed can be better if use bigger grid, but don't remember best values. I think nobody uses it with default grid size if there are better values.
If you know them, I will explain the problem with exact numbers.

Optimal grid size in JLP's implementation is a hide and seek game. That's because theoretically optimal size results in much degraded slowdown. There's also a huge amount of kangaroos computed first (gridSize x 512) and you can end up with resource overload.

Since his kernel compiles to 96 registers (ccap 8.9 for RTX 4090) maybe try a grid like (128 * 2) x 512 so each SM fits 2 blocks each of 512 threads. That's close to 70 million kangaroos.

What problem was in DPs, except for the fact that there's all the chances in the world they might be too many of them and get lost?

So 128*2 * 512 = 131K threads and each thread processes a group of 128 kangs, right? It's 16.8M kangs, correct? And speed is about 3GH?

Yeah, sorry, forgot I had changed the group size to 512, though it only impacts the initial time to compute the kangaroos, not the kernel performance. Some people here claim to get to 4 Gk/s on a RTX 4090, I did not investigate too much what the best grid size would actually be, I gave up trying as soon as the GPU -check flag was filling the stdout with "warning: xxxxx DPs were lost"...