If you want to believe me, then I can vouch for mtrlt's gpu miner being significantly more efficient than any current cpu miner for scrypt.
From what I know of the gpu miner, option 3 of modifying the scrypt parameter will have minimal impact. The pad size did not seem to matter much, and can be compressed for lack of a better word, with on the fly value reconstruction. So any increase in pad size will have a relatively equal impact on cpu miners until you exceed their cache size, at which point, gpus may become even more efficient.
I think you will be stuck with option 2, finding a completely different hashing algorithm.
Are you saying he has disproved the sequential memory hardness for the ROMix algorithm from the original scrypt paper? I don't see why mtrlt couldn't supply us with at least some of the mathematics behind his algorithm.
If mtrlt has really completed this code, he could easily create an account on a pool with a random username and password, start mining, and have someone log on and verify that he's actually get 1M/s with four miners. It's really at no loss to him, he would only need to mine for a few minutes for the hash rate to be apparent to the other person. Why hasn't he done this?