On-chip registers are insanely more times faster than reading and writing data from and into GPU memory, which is what the JLP kernel is all about. The loops are also in the bad order if what we care for are the jumps, not the kangaroos.
Maybe you wouldn't agree on this, but it is a really bad idea to dump millions of kangaroos into memory and read/jump/store them, to achieve the same efficiency as less kangaroos with a correct average jump size. Why? Because you can have a much much better efficiency by using fewer, faster kangaroos, and have the same results, but much faster.
Let's say instead of 128 kangaroos, we launch 4 kangaroos per thread. In this case, each kangaroo makes, say, 32 jumps.
To get real coordinates after inverse batch, you need to store z coordinates of 32 points for each kangaroo somewhere.
I don't understand how you store at least 128 z coordinates in registers. You'll have to use global memory. Shared memory won't fit all this.
Can you explain a little?