The branching has ended up becoming the best outcome. It can evaluate those branches in parallel, and you can't easily optimize away branches for memory writes (and theres apparently like 2 or 3 good tricks to get rid of branch waste, its just none of them work on memory writes).
That's interesting. Branches always seemed to be published as anathema to well performing kernels. I guess it all depends on how much work is being done inside. For small vector sizes there are few ifs, but with uint16 there are quite a few, so it might be worth investigating there.
I should look at shuffle. Your way doesn't quite work though, vstore would output H !=0 hashes, which would trigger HW error alerts (and rightfully so) in the host code, and I consider the HW error tracking important. At least, assuming I'm reading that code right, anyways.
Yes, I'm still getting HW alerts and haven't quite worked them out yet. I posted the snippet earlier from memory and missed a couple of steps. The latest (broken) code I'm working with looks like this:
int16 selection = XG2 == (x)(0x136032ED);
if (any(selection))
{
x mask = Xnonce & 0xF;
x temp = shuffle(select(Xnonce, 0, selection), mask);
vstore16(temp, 0, output);
}
That "if" might be totally unneccesary, and I still don't quite understand how the output array works, but it might give you a better idea of what I was trying to do to avoid all those branches.
I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.
I'll be watching the repository then

It should almost definitely help with more modern CPUs and Larrabee/Intel MIC.