Wow. Quartus II is claiming 108MHz and 78k LEs for my latest tweaks to the fully-unrolled DE2-115 code. I expect I could've hit at least 110MHz if the optimizer hadn't given up once it reached its 100MHz target. Surely this is too good to be true, especially since I haven't done anything fancy like precomputing W? Sadly I don't have a board to test this with but ModelSim seems to indicate it's not obviously broken. Anyway, it's
up on github if you're feeling brave, and watch your cooling.
Edit:Does anyone know if code is available for the DE2 (not the DE2-115) board? I can get access to those through my university, so considering making it a summer project.
The Cyclone II EP2C35 based one? Should be able to fit a partially-unrolled pipeline onto that in theory, though it probably wouldn't be massively fast. You'd have to port the HDL over to that board and compile a bitstream yourself, but that probably just means changing the device setting and pinout and turning down the level of loop unrolling.
(
Edit 2: Apparently Fmax=110.34MHz for this design on the EP4CE115. Interesting. I think I should be able to improve on that a bit

)