TMTO works because scrypt writes the memory in-order from 0 to N-1 then reads them in a pseudorandom/unpredictable order; if you didn't save all the odd-numbered words you can recover any one of them on-demand by reading the word before it and doing the mixing operation. If, on the other hand, you wrote those words in a pseudorandom/unpredictable order as well, TMTO wouldn't be possible (or at least pointlessly time-expensive).
I think what you call a word is a 128 byte block. Whatever order you generate could be trivially stored in a map of N indices,
so you could still leave out alternate blocks and use the map to find the block from which to generate a missing one?!
So glad you actually wrote a
whitepaper. It's a huge pet peeve of mine when people make some big showy announcement about something and the only clear technical explanation is either their source code or a bunch of random postings scattered across reddit and some webforums

. Thank you for taking the time to do this.
Thanks! I consider a detailed whitepaper a necessity for getting proper exposure and criticism.
The paper needs a lot of updating to take the recent algorithmic improvements into account.
I'm still going through the paper (should finish this weekend) but just one nitpick for now:
Memory chips, in the form of DRAM, have only a tiny portion of their circuitry active at any time, and thus require orders of magnitude less power.
That
can be true but it depends on the access pattern... DRAMs have a lot of internal parallelism, so if you can blast reads/writes at a nice uniform stride that matches the bank size you can actually get more or less the whole thing running at once. GPUs are designed around exploiting this. The cycle time at the pins is way way way shorter than the cycle time of the internal arrays. In fact as you move inward from the pins towards the capacitors each stage of circuitry has a longer cycle time than the previous one (sorta like a pyramid).
But yeah, what you say is true if you're doing serially-dependent reads and it's mostly true if the address pattern is pseudorandom.
Ah, thanks for elaborating. That explains how Cuckoo Cycle can handle so many threads that are all hammering main memory.
With your understanding of parallellizability of memory accesses, do you see possible improvements in the bucketing
mechanism in the current implementation?