However "LTC Scrypt" uses a mere 128KB of RAM. It all occurs on the GPU die (which has more than enough register space and L2 cache to hold the scratch pad). GPU memory latency to main memory (i.e. the 2GB of RAM on a graphics card) is incredibly long and the memory latency from GPU die off card to main memory is measured in fractional seconds. Utterly useless for Scrypt. If LTC required that to be used, a GPU would be far inferior to CPU with their 2MB+ of L2 and 6MB+ of L3 low latency cache. "Luckily" the modified parameters selected for LTC use a tiny fraction (~1%) of what is recommended by the Scrypt author for memory hardness even in low security applications and roughly 1/6000th of what is recommended for high security applications. It makes the scratchpad just small enough to fit inside a GPU and allow significant acceleration relative to a CPU.
Try bumping the parameters up just a little, GPU performance falls off a cliff while CPU performance is far more gradual. It doesn't matter if you attempt this on a system with 16GB (or even 32GB) of main memory. You can even try using a 1GB vs 2GB graphics card with negligible change in performance. The small memory scratchpad ensures neither a GPU main memory or the computer's main memory is used. The cache, inside the CPU die for CPU mining, or inside GPU die for GPU mining is what is used. Ever wonder why GPU accelerated password cracking programs don't include scrypt? The default paramters make the average GPU execution time <1 hash per second. Not a typo. Not 1 MH/s or 1 KH/s but <1 hash per second.
That is why "reaper" was so revolutionary but only for the weakened version of Scrypt used by LTC. It requires much less memory but still too much memory for a single SIMD unit and GPU main memory has far too much latency. That makes LTC impossible to mine on a GPU right? Well people thought so for a year. Reaper used a workaround by slaving multiple SIMD units together it stores the scratchpad across the cache and registers of multiple SIMD units. Now this reduces the parallelism of the GPU (which is why a GPU is only up to 10x better than a CPU vs 100x better on SHA-256). The combined register/cache across multiple SIMD units is large enough to contain the Scrypt scratchpad. This wouldn't be possible at the default parameters (~20MB of low latency memory) but it certainly possible at the reduce parameters used by LTC.
That's not how scrypt GPU mining works. You are implying that the GPU memory is not used at all, but this is bullshit (just try to downclock the GPU memory and see the effect yourself). You are implying that the memory latency is somehow important, but this is also bullshit. The memory bandwidth is the limiting factor. You are implying that only a single 128K scratchpad is used per whole GPU (or per SIMD unit), but this is also wrong. In fact thousands of hashes are calculated simultaneously and each one of them needs its own scratchpad (of configurable size and not necessarily 128K). You really have no idea what you are talking about.
About the passwords hashing. That's a totally different application of scrypt algorithm and has different requirements. To prevent passwords bruteforcing, you want the calculation of a single hash to be as slow as possible (within reasonable limits, so that verifying passwords does not become too slow). That's why the recommended scrypt parameters are set so high. Just to give you an example, let's imagine that the LTC scrypt parameters are used for hashing passwords. With a GPU you can easily have ~1000 kHash/s LTC scrypt performance, it means that you can try 1000000 different passwords per second for bruteforcing purposes. And for example, when using only lowercase letters and not really long passwords, it's a matter of just seconds or minutes to bruteforce it with such hashing speed. That's why the parameters used for LTC scrypt are not fit for passwords hashing. Check
http://en.wikipedia.org/wiki/Password_strength for more information.
However for mining purposes, making a single hash calculation as slow as possible is not a requirement. The absolute hashing speed is irrelevant. The difficulty is adjusted anyway, based on the total cryptocurrency network hashing speed. We just kinda care about the fairness between CPU/GPU/FPGA/ASIC, so that none of them gets a really huge advantage (normalized per device cost or transistors budget). And scrypt performance nicely depends both on the memory speed and on the speed of arithmetic calculations, doing a better job levelling the difference than sha256 from bitcoin.