One thing I definitely want from Diapolo's most recent work is the new output method.
It's interesting that it should be faster
a = (b == c);
is the same as
if (b == c) a = 1; else a = 0;
as far as code is concerned and ends up being compiled into the same thing in c.
Perhaps it's a weakness in the opencl compiler, or maybe it's just the jump path is inefficient since there are multiple statements in the if {} loop.
Anyway I've merged it for what it's worth, and updated poclbm accordingly as well.
(EDIT: By the way, I'm not seeing any improvement).
I tried to avoid control flow (which has very high costs on GPUs) and this method was mentioned in an AMD APP SDK document.