It's not Diapolo's last changes I want in cgminer, but rather phateus's latest (phatk 2.2).
This runs at 371 MH/s on my 5970s
And now that I look at the code ... the phatk kernel has an API that is not compatible
with what cgminer calls (phatk OpenCL code requires pre-computed values that aren't
provided by cgminer)

I guess the approach phoenix took is the best: via the __init__.py module it lets a
kernel precompute whatever it wants before it calls the OpenCL code ... we would
need something similar for cgminer (i.e. a combination of a .cl kernel and some
init code). Unfortunately, that's kind of hard to pull off in C (short of adding a LUA
interpreter to cgminer). Tough.
I was mistakenly under the impression your changes incorporated everything from the latest kernels. Perhaps I may have to roll them all back and start porting it all myself. Darn.