I'll just leave this here.
Damn, that's slow. Seems to scale almost perfectly with hardware memory bandwidth when comparing with Claymore's AMD miner. R9 290X has 3.7x theoretical memory bandwidth compared to 750 Ti and does around 600 H/s. Surprise, 600 / 3.7 comes to around 162. Same story with 270X and it's rougly 2x mem bandwidth. Guess that's not entirely unexpected since there's a whole lot of global memory access going on with the cryptonight algo. Still poking at it but I doubt it'll improve much without C&C level voodoo magic and that's well beyond my skillset

Nice work! What sort of power usage do you see?
Ah, right. Forgot to mention that. That particular rig is pulling around 270W from the wall when running cryptonight. I guess that qualifies as mini-good-news.
are you open sourcing ? (otherwise I am not seeing the point for telling us it exists but you can't use it... this is becoming a peculiarity of this thread somehow

)
That's the idea, still needs some work though. Just added a command line switch to set how many blocks and threads per block to use in the kernel launch as it was hard coded for what my 750 Ti seems to like best. Odds are that it wouldn't run at its best on other cards with the same settings. Next up is looking at building on Windows, didn't look too hot on first try.
possible to include GPU / CPU / RAM hybrid coding to improve speed ? Maybe certain part of hash can be run by CPU / RAM to boost speed. Perhaps cbuchner1 can give some pointer to increase speed.

It seems for certain that speed can be improve since the your rig is using very little power running it.