Yes, this is without any branching (similar to alternative 2) from my previous post except that I had W[3] wrong.
Basically the best would be to profile both and choose the faster one. With branching and without divergence, you have an additional clause (with divergence the penalty is worse as both "paths" would be serialized). However, without branching you introduce 7 dependent additions (can't pack them in two VLIW bundles as the result of the next addition depends on the previous one). I am not sure which would be faster.
BTW for the scalar case, you don't need that:
#else
v = select(W[3],(u)0,(v==g));
uint nonce = (v);
as direct comparison might be faster, especially with predication. E.g:
nonce = (v==g) ? W[3] : 0;
Unfortunately, this is not useful in the vector case. Of course you could try:
nonce = (v.s0==g.s0) ? W[3].s0 : nonce;
nonce = (v.s1==g.s1) ? W[3].s1 : nonce;
...
But that would generate much more inefficient code than that generated by using select().
So, will having partial matches in a vector cause for any problems?
The only problem is when you have more than one matching component pairs (v.sX and g.sX). For example v.s0==g.s0 and v.s3==g.s3. The version with branches would eventually have one of the two nonces written correctly in the output buffer (namely W[3].s3), the version with select() would have the wrong nonce written in the output buffer (W[3].s0+W[3].s3).