This new patch has been tested and is generating (at least in a test network), and the move of scratch space into thread local memory has improved performance quite a bit, from ~1880kh/s to ~3000kh/s on my system.
Did anyone succeed in building and testing on other platforms / GPUs?