Double check me on this:
tx_pre_w <= s0(rx_w[2]) + rx_w[1]; // Calculate the next round's s0 + the next round's w[0].
tx_new_w <= s1(rx_w[14]) + rx_w[9] + rx_pre_w;
right.. though the tx_pre_w can be saved at w[0]'s place that is to be transmitted to next loop, it will save a register..