You are correct in saying that it is noted in the original paper but only as a reference.
A block header with no transactions would be about 80 bytes. If we suppose blocks are generated every 10 minutes, 80 bytes * 6 * 24 * 365 = 4.2MB per year. With computer systems typically selling with 2GB of RAM as of 2008, and Moores Law predicting current growth of 1.2GB per year, storage should not be a problem even if the block headers must be kept in memory.
In my view it was done as a way to limit the amount of wastage as the block time increases you would see wastage drop there could also be a number of other factors that could point to why the 10 min time was included.
Such as if many nodes were to be generating the same block at the same time it could lead to more frequent forks this in turn would lower the strength of the network and push the times for conformations up.
For example if everyone is to be mining block 999,999 and miner A solves the block first and pushes it to the network all the other miners who would be at { t0 + t } t= time taken by miner A to mine and submit block 999,999.
Lets say miner B finds miner A's block at {(t0 + t)+ 1} with taking into account network latency miner B and all other miners who finish mining the same block where (t0 + t)< (t0 +t) + (delta_t)i < (t0 + t) + 1
All are successful in mining but they would have wasted there energy in doing so this shows the total wastage of the network resulting in a orphan block would need to be.
sum of { (t+ (delta_t)i ) * (hr)i } for every i
I'm sure there are other reason behind this that I have missed out on but it is an interesting topic to look over even just as a thought experiment.