I made a couple of changes to the sse2 source that sped it up about 5 or 6%:
I changed:
#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(_mm_add_epi32(x0, x1), x2), x3)
to
#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(x0, x1),_mm_add_epi32( x2,x3))
It is just re-ordering the adds. There is a data dependency, each one depends on the result of the one before, the way I reordered it two of the adds are independent. This function is called a lot of times so that little change can add up. (On an older machine it made no difference so YMMV)
A portion of the nonce calculation is repeated over and over, even though the result is the same. I moved
nonce = _mm_set1_epi32(In[3]);
nonce = _mm_add_epi32(nonce, offset);
out of the "for(k = 0; k
Here's a diff
153c153
> __m128i nonce,preNonce;
---
< __m128i nonce;
157d156
> preNonce = _mm_add_epi32(_mm_set1_epi32(In[3]),offset);
179,182c178,180
> //nonce = _mm_set1_epi32(In[3]);
> //nonce = _mm_add_epi32(nonce, offset);
> //nonce = _mm_add_epi32(nonce, _mm_set1_epi32(k));
> nonce = _mm_add_epi32(preNonce,_mm_set1_epi32(k));
---
< nonce = _mm_set1_epi32(In[3]);
< nonce = _mm_add_epi32(nonce, offset);
< nonce = _mm_add_epi32(nonce, _mm_set1_epi32(k));
I have been running this for a couple of days on the mining pool and have generated shares.