Post
Topic
Board Development & Technical Discussion
Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10
by
BeeCee1
on 21/01/2011, 02:11:10 UTC
I made a couple of changes to the sse2 source that sped it up about 5 or 6%:

I changed:
#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(_mm_add_epi32(x0, x1), x2), x3)
to
#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(x0, x1),_mm_add_epi32( x2,x3))

It is just re-ordering the adds.  There is a data dependency, each one depends on the result of the one before, the way I reordered it two of the adds are independent.  This function is called a lot of times so that little change can add up. (On an older machine it made no difference so YMMV)

A portion of the nonce calculation is repeated over and over, even though the result is the same.  I moved
nonce = _mm_set1_epi32(In[3]);
nonce = _mm_add_epi32(nonce, offset);
out of the "for(k = 0; k
Here's a diff
153c153
>     __m128i nonce,preNonce;
---
<     __m128i nonce;
157d156
>     preNonce = _mm_add_epi32(_mm_set1_epi32(In[3]),offset);
179,182c178,180
>         //nonce = _mm_set1_epi32(In[3]);
>         //nonce = _mm_add_epi32(nonce, offset);
>         //nonce = _mm_add_epi32(nonce, _mm_set1_epi32(k));
>         nonce = _mm_add_epi32(preNonce,_mm_set1_epi32(k));
---
<         nonce = _mm_set1_epi32(In[3]);
<         nonce = _mm_add_epi32(nonce, offset);
<         nonce = _mm_add_epi32(nonce, _mm_set1_epi32(k));


I have been running this for a couple of days on the mining pool and have generated shares.