just got off a long day of work, brain hurts, so I havn't really gone over what you've written here so far, but skimming it definitely seems like it will help me understanding everything. When I get some free time I'll try to analyze it and work it all out, maybe even draw a single cell diagram for anyone else trying to wrap their heads around it.
On another note - the 2engine design I let running finally finished, and failed. It ran out of placement sites - said it was short like 1k FFs and 3k LUTs to implement the design - and this was before routing, so it may not even have been able to attain 100 MHz there. I am running it again, but it may take 2-3 days again - but this time I've enabled some more aggressive optimizations in the Xilinx ISE.
We might have to settle for having 1 fully unrolled engine, and then 1 LOOP_LOG2=1 engine running and that should hopefully get to ~150MHz pretty easily.
Has anyone else gotten a 100MHz version working? Should I compile a bit file for someone to test it? I hate flying blind with a target device to test on.