Anyway, the table with time-to-solve estimations posted by j2002ba2 means GPUs are 3..4-order more efficient than the code presented.
j2002ba2, if you are here, how many hops/sec you have reached on single Tesla V100? Is my estimation of 2 GHops/s correct? At least 1 GHops/s required if authomorphisms taken into account, but I can't beat even half of this value.
For me 4x Tesla V100 on AWS were running at 6515 Mj/s, this gives 1629 Mj/s for a single V100.
This is really cool! j2002ba2, thank you very much for posting this reference point.