LLE. Damn, you made me go back to the drawing board, just thinking on what you might have used to get to such speeds. For some of the things I came up with as explanation, there's literally just a couple of results on the entire web, but if those work as advertised, I may get a 3x speedup as well, just by freeing up a lot of registers that keep useless information. There's no other way I see, since I already got 100% compute throughput in the CUDA profiler but I'm nowhere near your speed. Great, now I also have a headache.
Don't work so hard, there is no reason for that

One more tip for you. As far as I remember there is one more problem in original JLP GPU code: every GPU thread processes A LOT of kangs. It will cause an issue with DPs for high-number puzzles like #130. It's obvious but it seems nobody understands it. Don't you see it too?
I'm not using his kernel... yes it's faster to NOT have many kangaroos/thread, in fact once you get past a single kangaroo/thread, the speed of jumping per kangaroo starts to decrease (I have benchmark proof of this, clearly showing this behaviour). This happens because the various execution units start to get used more and more. There's a very clear borderline where the number of kangaroos multiplied by the speed of jumping is at a peak and then suddenly drops because GPU memory starts to become required. I don't think people realize that GPU memory is really really slow compared to GPU execution speed. I also think JLP either was not aware of this, or intentionally went the way of "let's fill GPU memory, because why not" though this actually slows down the whole thing.
However even after doing all the optimization tricks I can't seem to figure out how you have somewhat of a double / triple speed from my results (which already are a few times faster than JLP or any of its clones)...
Or did you simply get lucky with #120 and found it faster?