Also, you can simplify last sharound()'s - there is no need in calculating second variables:
Change
W(120);
sharound(120);
W(121);
sharound(121);
W(122);
sharound(122);
W(123);
sharound(123);
To
W(120);
sharound(120);
W(121);
Vals[2] += t1(121);
W(122);
Vals[1] += t1(122);
W(123);
Vals[0] += t1(123);
Seems like a good idea, but I checked via KernelAnalyzer and it doesn't lower the needed ALU operations ... perhaps with reordering of commands this will help. Looking into it and thanks for your posting!
Dia