on Kepler, around 15% better.
on
my kepler card (12x48) it's ~10h/s worse...
What hash rate are you getting with that card? I have a GTX 660 SC 2GB that's getting ~220 H/s using 16x40 . That's on both Tsiv's latest and Wolf0's latest is ~230 H/s.
Good to hear some good news, finally

EDIT: That new middle loop seems to perform about the same, sadly.
Just a quick question Wolf. What version of Cuda Toolkit are you using? I have it compiled with 5.5