1) Firstly, the double sha256 is a total of 3 rounds (with 64 steps each) - just the whole first round is constant across a full nonce range.
(commonly known as the midstate) that you only need to do once per nonce range.
2) Secondly, the first 3 steps of the 2nd round are constant across a full nonce range.
3) Thirdly, some of the W values are also constant across a full nonce range (easy to work out which)
4) Then finally, as you said, you don't need to complete the last 3 steps of the 3rd round.
In ASIC terms it would be risky to implement any of 2, 3 or 4
While you may gain a few % overall (6 out of 128 steps plus W optimistations) it also means you can only sha256 an exact BTC block header.
If BTC continues to use sha256 but makes any changes to the block header, then that wouldn't be a problem if none of steps 2, 3 or 4 were implemented in the silicon, since you could change the firmware to deal with a different header.