It is my implementation of BigStepGiantStep algorithm for Nvidia card (Cuda and Windows x64 only) https://github.com/Etayson/BSGS-cuda Let me know of your speed results.
I tested your BSGS on GTX 1660s, the speed was significantly slower than JeanLucPons Kangaroo: BSGS-cuda => 330 Mkey/s Kangaroo 2.2 => 450 Mkey/s
Need real tests on how many time need for find exaple pprivkey, what code find faste.