Like what are y'all even talking about here? You prefer faster kangaroos versus more, slower kangaroos...ok, what's the sweet spot? What bit range? What DP is used?
All of these play a factor...I don't think you can say x y z is always better.
We are talking about the fact that a single kangaroo that jumps at speed N*x is better than N kangaroos jumping at speed x. So, same throughput, fewer kangaroos = lower DP overhead = less number of operations. So the DP or the range don't play any factor.
Isn't this a super easy task, to test? Give me the same program, and I will run it with a GPU and then with a CPU, and let's see which solves the key first. Let's make it an 80 bit range. 1 GPU versus a single core, or do you want to use as many cores as the CPU has? Any bets on which one finds the key first?
Also, RetiredCoder, make mods to the program, to create less "kangs" when using a GPU, if it's to crazy for you...it's super easy to do. And another question, how does the speed of "kangs", impact the finding of High DP bits. Does a CPU (which the individual kangs are faster) find high DP bits, faster? Or does the GPU's slow, but many, find more, DP bits, faster?
And the last question, which "high puzzles" have you solved and what did you use to solve (CPU, GPU, DP, etc)
No one said that GPU kangaroos have to be "slow". I think you misunderstand why they are slow in JLP. Imagine a 10-band highway, and you have 1000 cars, and they all need to go from point A to point B before moving from point B to point C. So you need a very large parking lot and a lot of waiting time to move the cars in and out from the highway.
If you think that having more kangaroos is faster because the inverse can be batched (lots of kangaroos on each core), I can assure you that is not the case, and also that the inverse can still be batched.