Option #2, giving up most of the memory-hardness to get a primitive useful for a iterated hashing PoW.
Parameters used are N=1024,p=1,r=1 -> 128KiB of V.
Not exactly much, but enough to give current gen GPUs a major headache.
Basically, instead of aiming for "enough memory use to hit external memory on any device", aiming for "small enough to fit in L2 cache on CPUs, big enough to require too much on-chip memory/cache for running enough parallel instances to make GPU/FPGA/VLSI worth it."
If a future generation of GPUs has several MB of fast on-chip cache (which would make em competitive), a forced update with increased memory usage and parallelism is on the wall (e.g. N=4096, p=4 or 8 ).
And it'd still be "fast enough" to not cause too much of an issue with speed of verifying PoW, with N=4096,p=8 still only ~20ms per hash on a current decent CPU core.
Notice that N=4096 p=8 r=1 is one of the parameter sets used in the scrypt paper for relative cost on specialized hardware.
using the same cost/area figures for VLSI as the paper, rough relative estimates:
md5=1
double-sha256=6
scrypt(1024,1,1)=1000000
scrypt(4096,8,1)=40000000
Or let's put it another way, best relative performance achieved real-world for sha256(sha256()), scrypt(1024,1,1), scrypt(4096,2,1) and scrypt(4096,8,1) for a 3GHz K10 core vs. a HD6970
double-sha256 1:120
scrypt(1024,1,1) 1:5.2
scrypt(4096,2,1) 1:3.7
scrypt(4096,8,1) 1:4.4
That's without using vector instructions on the CPU, using a time-memory tradeoff for scrypt on the HD6970 that only stores 1/4 or 1/8 of V values and recreates the missing ones on the fly, straightforward GPU version was slower by a factor of 3-5.