I tried to figure out the reason version 2.2 does not work well with VECTORS4
I could not find out why as I do not have enough knowledge.
Here are some results I found:
replacing this block of code in version 2.1 with the corresponding block in version 2.2 will make VECTORS4 much slower
#define P1(n) ((rot(W[(n)-2],15u)^rot(W[(n)-2],13u)^((W[(n)-2])>>10U)))
#define P2(n) ((rot(W[(n)-15],25u)^rot(W[(n)-15],14u)^((W[(n)-15])>>3U)))
#define P3(x) W[x-7]
#define P4(x) W[x-16]
//Partial Calcs for constant W values
#define P1C(n) ((rotate(ConstW[(n)-2],15)^rotate(ConstW[(n)-2],13)^((ConstW[(n)-2])>>10U)))
#define P2C(n) ((rotate(ConstW[(n)-15],25)^rotate(ConstW[(n)-15],14)^((ConstW[(n)-15])>>3U)))
#define P3C(x) ConstW[x-7]
#define P4C(x) ConstW[x-16]
//SHA round with built in W calc
#define sharoundW(n) Vals[(3 + 128 - (n)) % 8] += t1W(n); Vals[(7 + 128 - (n)) % 8] = t1W(n) + t2(n);
//SHA round without W calc
#define sharound(n) Vals[(3 + 128 - (n)) % 8] += t1(n); Vals[(7 + 128 - (n)) % 8] = t1(n) + t2(n);
//SHA round for constant W values
#define sharoundC(n) Barrier(n); Vals[(3 + 128 - (n)) % 8] += t1C(n); Vals[(7 + 128 - (n)) % 8] = t1C(n) + t2(n);
//The compiler is stupid... I put this in there only to stop the compiler from (de)optimizing the order
#define Barrier(n) t1 = t1C((n) % 64)
And this block is not the only thing that causes the problem.
I am guessing there is something to do with rotC function.(it is a guess only