I was able to integrate the SHA256 functionality from Crypto++ 5.6.0 into Bitcoin. This is the fastest SHA256 yet using the SSE2 assembly code. Since Bitcoin was sending unaligned data to the block hash function, I had to change the MOVDQA instruction to MOVDQU.
I think using the SHA256 functionality from Crypto++ 5.6.0 is the way forward right now.
http://www.filedropper.com/bitcoin-033is this the x86 asm? I dumped out the x64 asm and integrated it and performance has proved to be nothing short of blistering.