i tried it on one of my 3 x 6990 mining rig - normally using poclbm with the phatk per GPU 408 MH/s - with your kernel per GPU 407 MH/s
This isn't exactly a fair comparison since phatk was specifically targeted at VLIW5 GPUs on SDK 2.4. A better comparison would be against phatk without this modification.