One more piece of science! 
Everybody knows that when we calculate "NextPoint = PreviousPoint + JumpPoint", we can also quickly calculate "PreviousPoint - JumpPoint" because the inversion is the same.
Therefore, if the inversion calculation takes a lot of time, this second point is cheap for us, and we can use it to improve K.
I updated Part #1:
Added "SOTA+" method with K = 1.02. I still study your code, especially the gpu part, which gives me a lot to learn.
I think JLP applied this trick in his BSGS/kangaroo.