The main idea is that computing public keys as fast as possible is even faster if they're not moving around blindly, like what happens in Kangaroo. But no one tried to use this effectively. I think this is because on a CPU there's basically zero performance difference, but on a GPU... the speed rockets (not that it wasn't already freaking fast) because the unknown dynamics disappear. Hence, a speed 3 to 4 times higher.
Very nice! I think the biggest bottleneck in computing public keys or (better said) in adding two public keys, which we usually only need, is the modular multiplicative inverse. If we find some low level optimizations, that can speed up the computation as well very much.