Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to. Here's what I've done so far. It's not much, but it works. Don't go changing the github source just yet though. For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file. For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.
I pasted your ASM into sha256_xmm_amd64.asm and changed "movntdqa" to "movdqa" like you said for sse2. But I get a linker error.
...
cgminer-sha256_sse2_amd64.o: In function `scanhash_sse2_64':
sha256_sse2_amd64.c:(.text+0x4fb): undefined reference to `CalcSha256_x64'
sha256_sse2_amd64.c:(.text+0x50b): undefined reference to `CalcSha256_x64'
collect2: ld returned 1 exit status
...
I had to change "CalcSha256_x64_sse4" to "CalcSha256_x64" in two spots. Then the compile went just fine. I'm running now to see if it's any faster and if any work actually gets accepted bu t hopefully it's bug free.
btw, doesn't the assembler do basic inline math before assembling?
P.S. Hashrate looks really close to the same but I did get a work unit accepted just now.
EDIT: so the increase in speed, if any, is around 1% increase maybe slightly more. I only have two cores at 3.5 Mh/s each so it's hard to see the difference on the scale of Mhash/s.