Doubling up keys in a clever way might get that to 90 but at the cost of probably unacceptable register pressure. I tried it once and threw away the code, but there are a few other ways to imagine doing it.
It's really hard to beat the raw number of ALUs those AMD devices have when the code is as trivially parallel as brute-force hashing.
have you heard of the lookup_gap feature of the OpenCL based miner? reduces the scratchpad size by factor 2 or 3 and replaces the lookups by some extra computation. Not sure if nvidia cards have the computational reserves, but we could try...the funnel shifter seems to help a bit in creating some breathing room. compute 3.5 devices are definitely memory limited when hashing.