Thanks, I had also wondered about doing something like that.
In the design that I have, it would look something like this, after your step 3:
1) Open a stream to validate the new block
2) Open a stream to write the block to disk
3) Open a stream to download the list of block tx hashes
3a) This could optionally use the idea I have above, except you would proving that chunks of tx hashes make up a block as you grab them. Might be overkill, and you'd want to do more than 512 at a time since just the tx hashes are a lot smaller.
4) Start reading from the tx hash stream and spin off requests for any missing txes, forward the txes to the block writing stream in order as they come in.
5) As txes get streamed into the block, that then gets streamed into the block validator.
6) The double-spend check is the only serial part of the validator as the block streams through it. Looking up previous tx outputs and validating scripts is done completely in parallel, and can be spanned across multiple disks & CPUs. If you've already validated the tx outside of the block, you can just skip it here.
7) Once everything has finished streaming through, and no errors have occurred, commit.
If you use 3a, you can stream into 4 before you've grabbed all the tx hashes. If you don't, you need to get them all and verify the merkle root first to make sure you have the right hash list.