New version is ready, DL here:
http://www.mediafire.com/?f8b8q3w5u5p0ln0Updated first post with changelog and performance info. This one should be a bit faster on 69XX cards than the original phatk, faster than all other phatk versions I did on 58XX and faster on non BFI_INT cards because of a change user 1MLyg5WVFSMifFjkrZiyGW2nw suggested!
Try, have fun, comment and donate

.
Thanks,
Dia
Thanks, best version yet

Still not reached the 40 MHash/sec the wiki says my card could do

Did you notice that Ma(x, y, z) is defined exactly the same now whether BFI_INT is enabled or not? Seems more elegant to me if moved out of the #ifdef. Also I tried to replace some #define's with functions, guessing that it would make it easier for a somewhat smart compiler to find repeatedly used terms and put them into registers. No performance improvement, but didn't hurt it either.
Also, OpenCL has a builtin Ch function, not faster for me but maybe for someone else:
#define Ch(x, y, z) bitselect(z, y, x)