Short version: compared to (1024,1,1) increasing N and r actually helps GPUs and hurts CPUs.
Longer version:
While things are small enough to fit in L2, each CPU core can act mostly independently and has pretty large read/write BW, make it big enough to hit external memory and you've got ~15GB/s shared between all cores.
Meanwhile, GPU caches are too small to be of much use, so... with random reads at 128B/item a 256 bit GDDR5 bus ends up well < 20% peak BW, at 1024B/item that % increases very significantly.
end result, a 5870 ends up about 6 times as fast as a PhenomII for scrypt(8192,8,1). (without really trying to optimize either side, so ymmv).
The only way to make scrypt win on CPU-vs-GPU again would be to go WAAAY bigger, think > 128MB V array so you don't have enough RAM on GPUs to run enough parallel instances to mask latencies... but that also means it's REALLY slow (hash/sec? sec/hash!) and you need the same amount of memory to check results... Now who wants a *coin where a normal node needs several seconds and 100s of megs to gigs of ram just to check a block PoW for validity?
A Proof of work can be both requiring tons of memory, and be trivially verifiable.
See Cuckoo Cycle at
https://github.com/tromp/cuckoo