I did few optimizations (commited to github), I reached 13.6M giant steps of 2^30 per second on my Xeon X5647 2.93GHz with 12GB of RAM.
It solves the 16 keys of the odolvlobo's set in 3:35:53 from scratch, I mean without any precomputation.
If I convert to the Etar's unit: It does
14.6 PKey/s.
I posted the result in the readme of my program:
https://github.com/JeanLucPons/BSGS/blob/master/README.mdThanks to odolvlobo to provide this challenge.
Best wishes to Etar to optimize its program and fix his issue on GPU.