you have to change this line in kernel.cl. tested this with poclbm kernel only.
u W0, W1, W2, W3, W4, W5, W6, W7, W8, W9, W10, W11, W12, W13, W14, W15;
to
__local u W0, W1, W2, W3, W4, W5, W6, W7, W8, W9, W10, W11, W12, W13, W14, W15;
basically, add a __local key keyword to this line and it should increase your performance.