It is my implementation of BigStepGiantStep algorithm for Nvidia card (Cuda and Windows x64 only) https://github.com/Etayson/BSGS-cuda Let me know of your speed results.
I tested your BSGS on GTX 1660s, the speed was much slower than JeanLucPons Kangaroo: BSGS-cuda => 330 Mkey/s Kangaroo 2.2 => 450 Mkey/s