I rewrote the code so that it performs the double SHA256 first, constructing an in-memory tree of all the operations and variables (without actually evaluating them, arithmetically or symbolically.)
I did similar thing for GPU miner, but on 32bit words level. First compiled with my own class instead of int, thus stripping off all the preprocessing and inlining stuff, then generated C code, compiled it with -O3 -S -masm=intel, lifted assembly code back to C, gave a final test run on CPU and fed resulting mess to OpenCL compiler
