Dont give up! There's still more to optimize, I'm at 1694 ALU OPs (HD6970) at the moment.
I can't read or edit Phyton, so yes there is room if one could alter or add some more kernel arguments.
Strange thing is, that I saw some additions, of known values, which I tried to to eleminate via constants, but this led to lower kernel performance. I played around with this today and saw no more improvement ... too bad, was real fun the last days!
If you would like to share your work, we all will be happy

. What is your kernel doing for 58XX cards? I thought it makes no sense, to optimize one over the other and tried to reduce ALU OP count for both platforms.
Dia