There was an issue with hashing multiple blocks in the binary above. I've corrected the source for now, but I won't be able to compile a new binary until tomorrow. Here's the corrected source for anyone who is interested.
http://www.filedropper.com/bitcoin-032_2I gave it a try using the x64 Intel compiler with full optimization, performance is practically identical to the stock algorithm, in fact the new algo seems marginally worse.