The TOTAL counts aren't comparable between use_gpu_fermat_test true/false, should look at 2-chains. I'll remove the TOTAL count in the next version.
Good to know, thanks! Looking at 2/3 chains then the gpu_fermat_test is faster by ~ 1.4x (I do have a weak CPU in there). Is the candidate difference a red herring (1/9 as many) or just the way it is reported?