BUT: You did point something out that I think I missed. In the code I linked you'll see that the pre-calculated T1 value is stored in a separate register, not tx_state[7] as you listed in your example. On looking at my code, I believe you are correct; tx_state[7] is never used (except for the last round) so it could be removed or replaced with the partial calculation. Good catch, Anoynomous!
Not sure if the compiler catches this optimization automatically or not.
I'm reasonably sure Altera's compiler for Cyclone IV does because of the large decrease in resource usage. On Cyclone IV it uses less resources to store the partially pre-calculated T1 value than it does to store tx_state[`IDX(7)] because registering logic outputs is practically free but registering the output of another register ties up an entire LE per bit that can't be used for anything else. No idea if Xilinx's tools catch this though.
Double check me on this:
tx_pre_w <= s0(rx_w[2]) + rx_w[1]; // Calculate the next round's s0 + the next round's w[0].
tx_new_w <= s1(rx_w[14]) + rx_w[9] + rx_pre_w;
right.. though the tx_pre_w can be saved at w[0]'s place that is to be transmitted to next loop, it will save a register..
Oooh, cunning - nice one Anoynomous! Costs a register overall due to having to get rx_w[2] out of storage, but might be worthwhile. In theory could it be cheaper to do this with s1(rx_w[14]) + rx_w[9] instead?
tx_pre_w <= s1(rx_w[15]) + rx_w[10]; // Calculate the next round's s1 + the next round's w[9].
tx_new_w <= s0(rx_w[1]) + rx_w[0] + rx_pre_w;