What many pf you see is high CPU utilization caused by the Nvida OpenCL drivers. Expect some fix for this comming next two to three days. This is also the reason for large rigs to crash - 6 instances on Nvidia currently require a CPU with 8 threads... I will handle this next, definately.
CL Local Size is currently fixed to 256 because the opencl part requires it. Selecting a lower value might save some time in barriers but I know that this is currently not the bottleneck to handle to achieve more speed. But maybe a good idea for the future.
I am surprised how many Windows user are here. I definately have to find a Windows testing system for me to get the Windows port stable.
@qwep: What GPU and driver version do you have?