Re: further improved phatk OpenCL Kernel (> 2% increase) for Phoenix

So here are some more changes:

I introduced const uint W17_2, containing P1(19) + 0x11002000, thats 3 shifts, 2 xor, 1 add traded against one extra parameter, well worth it,

extended self.f:
self.f = np.zeros(5, np.uint32)
to
self.f = np.zeros(6, np.uint32)

just after W17 calculation in calculateF:
   #W17_2
   self.f[5] = np.uint32(0x11002000+(
   rot(self.f[2], 32-13) ^
   rot(self.f[2], 32-15) ^
   (self.f[2] >> 10)
   ))

added the parameter (right after W17) in call and function

=> Effectively 3 Op's saved.

next change:
You can cut out all W0 to W14! Most of them are zero anyway, just needed to hardcode the first ones.
Also W[73] to W[78] are not used anymore with some small changes, so no need to initialize them.

=> less memory use, but has the same speed for me

Next one:
Round 3

#ifdef VECTORS
   Vals[4] = (W_3 = ((base + get_global_id(0)) << 1) + (uint2)(0, 1)) + PreVal4;
#else
   Vals[4] = (W_3 = base + get_global_id(0)) + PreVal4;
#endif

--
   // Round 3
   Vals[0] = state0 + Vals[4];
   Vals[4] += T1;

--

W[64 - O] = state0 + Vals[0];

you can reorganize and shorten round 3 to:
   Vals[0] = T1 + Vals[4];

needed changes in precalculation:
Preval4 += T1
T1 = state0 - T1

=> another addition almost effortless

here the files with these changes:
http://www.filesonic.com/file/1423103594

still some more to come!