How feasible are you talking? So you are working on searching 256-bit interval using 128 + bits of DP mask too?
I forgot about this thread - sorry about that.
Yeah I am, so far I have the actual program running on GPU, it's running at ~260MKeys/s with the expanded dpmask on my T4 though but it's making quite a large number of same herd collisions and dead kangaroos. I had actually expected the speed to be much faster, like around ~1500MKeys/s given that I saw someone's V100 do ~1100MKeys/s.
I doubt checking three more uint64's for equality within the main loop is what's causing this speed drop but it's a good opportunity to peek into the CUDA accelerated Int class and see what else can be sped up.