on Kepler, around 15% better.
on
my kepler card (12x48) it's ~10h/s worse...
What hash rate are you getting with that card? I have a GTX 660 SC 2GB that's getting ~220 H/s using 16x40 . That's on both Tsiv's latest and Wolf0's latest is ~230 H/s.
Good to hear some good news, finally

EDIT: That new middle loop seems to perform about the same, sadly.
Just a quick question Wolf. What version of Cuda Toolkit are you using? I have it compiled with 5.5
6.0. Also, I made the scratchpad pointer in the second loop restricted, it seems to provide a small hashrate bump.
Okay, I will pull it and give it a test on my card.
Update: The new one did give me a boost to an avg. of 235 H/s in the miner. Average at the pool is 250 H/s.
Update 2: I decided to compile a new version of tsiv's. I am getting 255 H/s on the miner and 280 H/s at the pool.