No kidding. It broke my brain for awhile, until I realized it was just a delay chain, so you could add to cnt to get what cnt "looks like" at each stage in the chain.
Heh. There's a reason I'd been putting off making those changes to the partial unrolling originally; it was fairly obviously beneficial, but also rather fiddly.
Edit 3: Can't quite get it to hit 100 MHz with LOOP_LOG2=1 on the XC6SLX75 (actual period = 11.045ns). Looks like I'd need to port over the patch for the extra pipeline stage to compute the initial value of H+K+W[0].
On my last run, I think ISE reported an actual period of 15ns. I'm still getting used to the timing report in ISE, so I could be wrong. Regardless, that's with it targeting 50MHz so I'm sure it will give better results with tighter constraints. I will certainly try to patch it for the initial t1_partial; that's bound to be helpful.
I haven't been able to reproduce the 11ns synthesis run for the XC6SLX75 since fixing the values of K_next, and I can't entirely figure out why; best I've seen since then is 14-15ns. (That's with 100 MHz as the target.) You might have better luck with a fully unrolled design, but then again perhaps not.
makomk I also sent you a donation for the hard work you've done achieving 110MHz on Altera, and getting a fully unrolled core working on the LX150 chip. Many thanks to the both of you, and everyone who contributes to this project!
Thank you! Though I'm not sure the extent to which I helped with that second one... don't even have the tools to attempt such a thing.
However, it has been mentioned before that the Spartan-6 devices don't have fast carry chain routing on half of the slices. That may impede the ability to get two engines on an LX150.
Haven't done the math on that but it'd probably work out much the same as fitting one on the LX75: not enough carry chains for all the adders, won't fit without some trickery.
hrm... there has to be a better way to use those generate blocks to parse these values not as signals/wires to be used at runtime, but rather generate constant integers or look up tables/mux's to generate these...
I have some changes to do this, but they're on a computer I don't have access to this second and I don't think I pushed them to any public repos. (For some reason I appeared to be seeing a negative effect on Cyclone IV clock speeds at LOOP_LOG2=0.)
You well may find the tables aren't actually being synthesized as LUT RAM in the end anyway.