Try to play with gridsize, ex: -g 640,256,640,256,....
On the benchmark by DaveF, for a Tesla V100-SXM2-16GB we have 1.815 GK/s per board.
You have 40GK/s , so 2.5 GK/s per board, not so bad...
OMG, I changed the grid size as you suggested now I am getting 51 GK/s, amazing ...
Is there any chance to use regex with gpu?