Welp. Managed to split the most offensive part of the kernel into four parallel threads per hash, result is spectacularly unimpressive. The best I've come up with breaks even with the current single thread per hash implementation. Well, almost. It's actually a percent slower AND loses compute 2.0 compatibility due to using shuffle. On the other hands it performs a lot more reasonably with various launch configurations, 15 blocks of 32 threads works our equally well as the original 8x60 magic bullet for 750 Ti.
At this point I'm starting to think I'll just forget about that part and start looking if there's something else to be improved. I'm still curious as to how it runs on other hardware, so if a couple of gents on Win boxes with something else than a 750 Ti in would be willing to take it for a spin, I'd appreciate it. I've added the number for SMX/SMM/Whateverthingmabobs into the miner thread start-up info, you'll probably find your card performing best when the block count is a multiple of the SMX count and the number of threads a power of 2. 4/8/16/32/64 are the best bets.
https://github.com/tsiv/ccminer-cryptonight/releases/download/v0.15-rc1/ccminer-cryptonight_20140723_exp.zipAlso, any chances for this code to get released already? Or are you competing against Wolf0

It works like a charm, 220H/s for GTX760, before it was 190. GTX750TIs seem unchanged.
I get 270H(peaks of 297H with -l 8x50) with this release and a GTX 760 overclocked -->v0.15-rc1 ccminer-cryptonight_20140723
Thanks for that launch setting

306H/s (MSI gaming, +180core, +500mem). Still have to test what's the most stable, but thanks for giving me a start

Ooh damn, you've released that a looong time ago, tsiv. Should've noticed ^^"
EDIT: 320H/s with +222core, +666mem

I'm waiting anxiously for a driver crash
