It's not difficult, really. There are a lot of misconceptions about what scrypt can and can't do. I doubt there is a value or N or p for which GPU mining isn't advantageous.
Maybe for high Nfactor values, massive amounts of memory will be used. Then again gpus with big memory will sure take the advantage.
You can use less memory by just lowering the lookup_gap on the current GPU implementation and recomputing the lookup table on the fly. The amount of memory used is O(L^(-1) * m) where L is the lookup_gap and m is the size of the lookup table. Hence with L=2, threads only utilize 64 KB instead of 128 KB. Performance goes down per thread, but you can still calculate any size of N.