So here are some more changes:
I introduced const uint W17_2, containing P1(19) + 0x11002000, thats 3 shifts, 2 xor, 1 add traded against one extra parameter, well worth it,
extended self.f:
self.f = np.zeros(5, np.uint32)
to
self.f = np.zeros(6, np.uint32)
just after W17 calculation in calculateF:
#W17_2
self.f[5] = np.uint32(0x11002000+(
rot(self.f[2], 32-13) ^
rot(self.f[2], 32-15) ^
(self.f[2] >> 10)
))
added the parameter (right after W17) in call and function
=> Effectively 3 Op's saved.
next change:
You can cut out all W0 to W14! Most of them are zero anyway, just needed to hardcode the first ones.
Also W[73] to W[78] are not used anymore with some small changes, so no need to initialize them.
=> less memory use, but has the same speed for me
Next one:
Round 3
#ifdef VECTORS
Vals[4] = (W_3 = ((base + get_global_id(0)) << 1) + (uint2)(0, 1)) + PreVal4;
#else
Vals[4] = (W_3 = base + get_global_id(0)) + PreVal4;
#endif
--
// Round 3
Vals[0] = state0 + Vals[4];
Vals[4] += T1;
--
W[64 - O] = state0 + Vals[0];
you can reorganize and shorten round 3 to:
Vals[0] = T1 + Vals[4];
needed changes in precalculation:
Preval4 += T1
T1 = state0 - T1
=> another addition almost effortless
here the files with these changes:
http://www.filesonic.com/file/1423103594still some more to come!