You need to change the assembly instructions that require aligned input to unaligned -
http://bitcointalk.org/index.php?topic=453.msg5774#msg5774, or make the blocks that are being hashed aligned. I haven't tried yet, but this assembly code combined with the state caching modification should make this blazing fast.