1) Firstly, the double sha256 is a total of 3 rounds (with 64 steps each) - just the whole first round is constant across a full nonce range.
(commonly known as the midstate) that you only need to do once per nonce range.
2) Secondly, the first 3 steps of the 2nd round are constant across a full nonce range.
3) Thirdly, some of the W values are also constant across a full nonce range (easy to work out which)
4) Then finally, as you said, you don't need to complete the last 3 steps of the 3rd round.
Thanks. I'm quoting this because it is a very nice reference for the state-of-the-art GPU/FPGA optimizations. I remembered the 4) on your list the most because it most clearly shows the shift-register structure inherent to the SHA-256.
Edit: Note to self: Kano is swapping the standard terminology: step vs. round. Using standard terminology first SHA-256 hash in Bitcoin consists of 2 steps of 64 rounds each.
In ASIC terms it would be risky to implement any of 2, 3 or 4
While you may gain a few % overall (6 out of 128 steps plus W optimistations) it also means you can only sha256 an exact BTC block header.
If BTC continues to use sha256 but makes any changes to the block header, then that wouldn't be a problem if none of steps 2, 3 or 4 were implemented in the silicon, since you could change the firmware to deal with a different header.
At least for the chip discussed in this thread it appears that the block header structure is fixed:
0-31 writing midstate
32-43 writing data
44-47 reading nonce