If you want to believe me, then I can vouch for mtrlt's gpu miner being significantly more efficient than any current cpu miner for scrypt.
From what I know of the gpu miner, option 3 of modifying the scrypt parameter will have minimal impact. The pad size did not seem to matter much, and can be compressed for lack of a better word, with on the fly value reconstruction. So any increase in pad size will have a relatively equal impact on cpu miners until you exceed their cache size, at which point, gpus may become even more efficient.
Right now salsa20/8 is used as a core part of of scrypt (8 rounds for the inner loop). Replacing it with something like salsa20/2 (just two rounds in the inner loop) would significantly improve performance on CPU, because 4x less calculations would be involved. And the memory access pattern would remain the same, resulting in almost no improvement for the miners which depend on memory performance (GPU miners and also to some extent Cell/BE miner). So what's the problem? The variants of salsa20 with lower number of rounds are supposedly significantly less secure, at least in some cases:
http://fse2008.epfl.ch/docs/papers/day_3_sess_3/29_Lausanne_FSE08_camera_ready.pdfhttp://www.ecrypt.eu.org/stream/papersdir/2007/010.pdfBut I don't know how exactly this all applies to scrypt, because cryptography is definitely not my forte. That's why I think that bringing up the issue to the scrypt author can make some sense. That is after we get a better idea of realistic GPU performance.
I think you will be stuck with option 2, finding a completely different hashing algorithm.
Not in an attempt to troll the thread, but if you look at solidcoin's hash code, you will see it has random reads and writes that are of varying size, spread out over a large memory range, and are randomly aligned. These are key techniques in creating havoc with a gpu's memory access methods. I would suggest looking for code that has similar traits if you really want to defeat gpu's or at least keep them on a level playing field with cpus.
This is all nice. But can we be sure that these convoluted hash calculations can't be algorithmically optimized and reduced to something that can run orders of magnitude faster?