What grid size are you using with the "original JLP Kangaroo version 1.7" in order to see it at 700 Mk/s on a GTX 1660S? Are you sure about that speed being real?
Cause with no code changes and using the 1.7 release tag, I can't go beyond 300 Mk/s (or rather, ~250 in reality, since stats display is kinda broken) on a card that is both newer and superior to that model. And looking at the nvcc compile stats, I have some doubts that it would even be capable to go at a triple speed, on an inferior card with 25% less CUDA cores.
You probably configured the grid incorrectly. The Nsight shows that 4 blocks are working simultaneously. That's why I have a grid of 88*128.
And yes, the speed is +/- correct since it matches the amount of DP accumulated over a certain period of time.Or you used a small DP value, then of course the speed will be much less. For example, with DP 13 the speed is only 950Mk/s (on the patched version) and 1100Mk/s with DP 20.
P.S. But you are right there is a glitch with the speed calculation. JLP forgot to reset the average values before calculating the speed (thread.cpp)
After making changes the speed became 950 Mk/s
GPU: GPU #0 NVIDIA GeForce GTX 1660 SUPER (22x64 cores) Grid(88x128) (141.0 MB used)
SolveKeyGPU Thread GPU#0: creating kangaroos...
SolveKeyGPU Thread GPU#0: 2^20.46 kangaroos [7.6s]
[952.59 MK/s][GPU 952.59 MK/s][Count 2^35.49][Dead 0][50s (Avg 01:06:23)][3.4/10.6MB]
@kTimesG you have written more than once that your program is several times faster than JLP and other clones. Can you really run 1660 Super at 2Gk/s?
I would like to test this version with a 3060, can you share it?
the release is available you can try.
This is pain in *ss to install on Windows 11 with 3060 - Visual Studio 2022 + Cuda12
I've been trying for several hours without success. I managed to compile it for the CPU.