1 thread: ~4200 khash/s
2 threads: ~7700 khash/s
Suffice it to say that I leverage SSE instructions to calculate four hashes at once, per thread.
I tried to implement your idea and my SSE code is almost exactly 4x as fast as the vanilla code (~2000 khash/s with one thread, up from ~500). However, when running two threads I only get ~3000 khash/s. There is room to optimize my code, but still, the improvement is way lower that yours.
I don't think pure SSE can be exactly four times as fast as a well optimized C code, mostly because SSE lacks rotate instructions. Did you implement SHA completely in SSE or are you mixing SSE and C?