I tried GTX970, GTX660, GTX750ti... I think I saw every CL error between -1000 to 1000.

But I'm skill-less for this.
Try -w 32768 or any ridiculously huge value ^^ It will be your best result.
Feel free to come on
Discord. Good luck with that, It's definitely possible, with the appropriate skills.
Be aware that only one person claims to have run an nvidia miner. But I don't believe this person.
[2014-11-12 17:34:40] pps: 0 / 0.0000 10g/h 0.0000 / 0.0000 15g/h 0.0000 / 0.0000
We found our gap...
1460023432727399844421086333295273985776598611867186573289828851669587467837029 674331
[2014-11-12 17:34:41] curl_easy_perform() failed: Failed initialization
[2014-11-12 17:34:41] waiting for gapcoind ...
[2014-11-12 17:34:41] Found Share: 22.8490689436 => accepted
[2014-11-12 17:34:46] Got new target: 22.5826074437[/color]
I have just put in two hours of thinking on this but I'm sure the problem lays in that the chink of tdata is to big to fit into the device memory and should be halved. Your approach can possible in same cases fit the end into the memory and create something like a valid test.
Grahams solution seams to be the best because if my solution works then it maybe just valid for the graphic chip I have and also depends on which opencl version on the computer.
..