cgminer must be parsing the output buffer differently from poclbm (see SETFOUND macro in its kernels), otherwise looks good, at a glance. Please also note the kernel only returns diff 1 shares, for pool you are going to need to either hardcode the target or pass it to kernel.
Yes, cgminer parses it differently. I added this
// output[OUTPUT_SIZE] = output[nonce & OUTPUT_MASK] = nonce;
#define FOUND (0x0F)
#define SETFOUND(Xnonce) output[output[FOUND]++] = Xnonce
SETFOUND(nonce);
to your kernel and it
should work.
Unfortunately it never reaches this code (on Geforce GT 650) and I don't know why...
You can try a lower diff (uncomment the '& 0xc0ffffff' for example) and see what happens. You can also use printf() in the kernel for debugging if your SDK is at least OpenCL 1.1 compliant.