...
I looked briefly) You can add an initial random for points.
There is an accelerated version of GPUMath.h. Where the value _0[] _1[] _P[] from __constant__ memory is not used. MM64 is declared define - #define MM64 0xD838091DD2253531ULL. The modulo inversion function is also faster.