Current algorithms like Kangaroo don`t give u real keys/s information, hence your surprise, the speed of these algorithms is often related more to statistical performance than direct metrics like keys per second.
You are confusing the exakeys/s shown by some BSGS programs with the real speed (4000+ Mkeys/s) actually computed and analyzed by any real Kangaroo program.
That is, there are indeed 4 billion keys (public keys, and hence by induction private keys) computed per second, and each of them is a complete key (256 bits) which is processed, checked, and then jumped further.
No statistical BS there. Just a direct metric.
I saw the kangaroo code and it uses the length of the jumps as a reference for speed, this is not true, nor exact.
see check.h file.
What's the check.h file? Is it part of the Kangaroo algorithm?
RTX 4090 specs: FP32 (float) 82.58 TFLOPS
That's 82580 billion raw operations/s on floating-point numbers.
Once you divide by the number of instructions needed to do a single kangaroo jump (e.g. point addition under the EC modular field, P + Q = R), you're left with a few good N billion keys/s (where N is 4 or larger depending on the implementation).
You can do 5600000000 (that's 5.6 billion keys/s) on a RTX 4090, just to add that 4000 is slower than what the hardware can accomplish.
Stop spreading false information.