Fruitless cycles is the main problem of using equivalence, but there is few good ways to deal with them. By the way, I was very surprised when I saw how quickly kangaroos get into a primitive two-point cycle.
How does "good ways" look like? Without decreasing the overall speed more than ~35% (that would be the point at which dealing with the cycles has no benefits over not having them at all, e.g going from 1.36 to 1.72).
Of course when you're running on a CPU, dealing with the cycles is almost un-noticeable, so it's worth it, and can indeed bring down the complexity to 1.36 or even lower (1.25 at the best).But this benefit is quickly offset by the much more performant GPU devices.
Problem is on dealing with cycles on a GPU, oh well, let me just say that having even a couple of registers used for something else than performing some parts of some jump of some kangaroo, can decrease the speed by well over 100%. Not to mention the extra required access to memory when the registers are all full (to "deal with the cycles" one way or the other) bringing the speed down even further.
Not sure of RetiredCoder's method, technically if he solved 120 as he said ("crazy fast kangaroos") then he had no cycles to deal with at all. Or he found some clever way to deal with them in a way that doesn't greatly bring a GPU to a halt... but I'm not counting on that, it would be a waste of resources IMHO.