After running it for a while on my 6970s, I find even this change is of no benefit either

It's ever so slightly slower it appears. Not sure what to do now about all this.
I'll just be quiet and go back in my hole for a bit. :/
Don't be disheartened. I'm forever hacking up code and then deleting it. You wouldn't believe how much code I've thrown out so far working on cgminer. Not every experiment is successful.
I'll attest to that. I've fiddled around with the SSE4 code for what seemed like endless nights increasing and decreasing hash rates and came to realize the my second to the last post had the best hashing rate and that my last was slower than the original. But I'm curious to see if we'll ever come up with something that uses the CPU and GPU together in the same calculations via the math library I posted earlier. It'll be interesting to see what comes out of it.
I'm still trying to figure out how to specify compiler options for cl code. I was going to test out the unsafe math optimization and see what comes out of it. There are actually quite a few optimizations to play around with for speeding up memory transfers via non-temporal writes, increasing data size, prefetching, using the differences between the initial constants instead of the actual values to shrink the table, etc.
It's really like trying to solve a puzzle with many different answers. Only, you're trying to solve it better than anyone else has already.