Re: Phatk2 GOFFSET Mod

I am kinda surprised that predication worked better than select(), usually it's just the opposite. Perhaps if you can send me both ISA dumps I can see what can be done to further improve that.

For the second part:

Quote

u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123),Vals[1],Vals[2]);
u g = -(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123));

Can we simplify these since they both contain (Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123)) ? It would certainly reduce calculations a bit. The only problem I see is Vals[1] and Vals[2] is inside of the parenthesis. Now, I'm not familiar with the comma symbolization here, but if the parenthesis can be put on the inside next to the ch(123), it's as easy as dividing by ((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123)) to remove it and make the math simpler for the GPU.

I don't think it's worth trying.

P.S I don't think ALU ops is a good performance metric. Of course that's important, but there are other factors. GPR usage and number of clauses is also very important, so you have to profile the overall result. I've seen many times situations where you have two kernels, one has a bit less ALU ops, other has just one more clause and the second one behaves much worse. Similarily, the situation with GPR usage. I am currently working on a RAR password cracking kernel and that poses some fucking paradoxes. For example I have several kernels, one keeping everything in __private memory with large GPR usage, another one that shifts some to __local memory and a third one that keeps a small lookup table in __global memory. Paradox is that the first one is the slowest, GPR usage is ~90, performance is disgusting. The one that keeps part of the data in __local memory behaves much better, 36 GPRs used, much better occupancy, but performance still not what I expected. The kernel that uses an intermediate __global memory buffer is currently the fastest one, mostly because of the cached global memory with SDK 2.6. It's twice faster than the second one and times faster than the first one. I would never expect that.