The GPU part of the code is way too complex to be efficient.
Consider simplifying it (a lot!)
that is the point, the binary transformations involved create unique values(no repeatence) and hit areas of range that random or sequential may never reach
yes, its not that fast but its unique in its own way