What grid size are you using with the "original JLP Kangaroo version 1.7" in order to see it at 700 Mk/s on a GTX 1660S? Are you sure about that speed being real?
Cause with no code changes and using the 1.7 release tag, I can't go beyond 300 Mk/s (or rather, ~250 in reality, since stats display is kinda broken) on a card that is both newer and superior to that model. And looking at the nvcc compile stats, I have some doubts that it would even be capable to go at a triple speed, on an inferior card with 25% less CUDA cores.
You probably configured the grid incorrectly. The Nsight shows that 4 blocks are working simultaneously. That's why I have a grid of 88*128. And yes, the speed is +/- correct since it matches the amount of DP accumulated over a certain period of time.