I've done a bit of reading on cuda/fermi and bank conflicts. Done a bit of experimentation. I found that the reason i was getting higher performance from my 64 bit builds was due to the extra byte memory alignment offset for 64 bit in the fermi kernel. I found that adding the same offset to the 32 bit kernel brought it up to speed with the 64 bit one. Some reading suggested that 36 byte alignments were optimal for avoiding bank conflicts. I have managed to get up to 234khash/s in linux. I have tested it on a fresh linux install with cuda 5.0 and confirmed that it's not just a quirk with my 5.5 windows setup.
Christian, you say you get 228 on your 560 ti 448 in 32 bit. I only get about 222 with 32 bit in windows or linux but this has pushed it to ~234. I could see yours doing 240 providing mine isn't somehow retarded and i found a way to work around it's retardedness.
Easy way to test,
#define _64BIT_ALIGN 3
whether it be for 64 or 32 bit. can easily be set back to 0 or 1 after if it doesn't really do anything.