Thanks for your fast update again. It works great with Kepler & Maxwell cards, but unfortunately not with Fermi cards. (I mean jackpot algo).
I got
780 - 6Mhs
680 - 3.7Mhs
750ti - 2.9Mhs
our stream compaction code currently makes use of the __shfl intrinsics (warp shuffle) and that has a minimum requirement of Compute 3.0, sorry.
I might try to add some emulation code later. It will be a bit slower though.