I added the cached SHA256 state idea to the SVN, rev 113. The speedup is about 70%. I credited it to tcatm based on your post in the x64 thread.
I can compile the Crypto++ 5.6.0 ASM SHA code with MinGW but as soon as it runs it crashes. It says its for MASM (Microsoft's assembler) and the sample command line they give looks like Visual C++. Does it only work with the MSVC and Intel compilers?