Doubling up keys in a clever way might get that to 90 but at the cost of probably unacceptable register pressure. I tried it once and threw away the code, but there are a few other ways to imagine doing it.
It's really hard to beat the raw number of ALUs those AMD devices have when the code is as trivially parallel as brute-force hashing.
have you heard of the lookup_gap feature of the OpenCL based miner? reduces the scratchpad size by factor 2 or 3 and replaces the lookups by some extra computation. Not sure if nvidia cards have the computational reserves, but we could try...the funnel shifter seems to help a bit in creating some breathing room. compute 3.5 devices are definitely memory limited when hashing.
Haven't seen it, but I'm guessing from your description that it stores only every other (or every third) scratchpad entry and dynamically recomputes when it needs to access an odd-numbered entry?
I've thought about it, but the nvidia kernels are so compute-bound that I never took it seriously. Do you know any numbers for how much it speeds up the OpenCL miner? It's not that hard to implement if it really seems worthwhile, but I'm skeptical for nvidia.
(It is, however, the obvious route to go for FPGA or ASIC.)