Christian, are you communicating with your Nvidia Friend about CUDA 6? Will it give any performance enhancements for our old Fermi cards?
the communication was so far limited to a kernel submission from nVidia.
It's a high register count (1 hash per thread) Compute 3.5 kernel that gives some marginal improvement over Dave Andersen's work. Unfortunately it's not well suited for implementing a LOOKUP_GAP.
Christian