Something ticked today... and it's so freaking simple. The "math" also works out, but I'll need some time to implement the end-to-end concept.
All I can say for now is that the kernel's processing 11 GKeys/s on a RTX 3090 and 22.8 Gkeys/s on a RTX 4090.
The main idea is that computing public keys as fast as possible is even faster if they're not moving around blindly, like what happens in Kangaroo. But no one tried to use this effectively. I think this is because on a CPU there's basically zero performance difference, but on a GPU... the speed rockets (not that it wasn't already freaking fast) because the unknown dynamics disappear. Hence, a speed 3 to 4 times higher.
The beautiful thing is to answer this question though: what the hell can one do with such computations. They definitely can't be extracted, right?
But no, this is not about magic splitting methods or reducing ranges. It still requires sqrt(N) total ops, however they run much much faster, and it's also fully deterministic (guaranteed upper bound). Oh, and no worries, it doesn't require amounts of storage the size of our galaxy.
Now, please don't scratch your head, call it BS, or whatever. If you want some hints, try to research some papers on DP alternatives, and move that idea to another algorithm.
Give her to me and I'll try it on my 5090 for you