I've been playing with the lookup gap. My GTX 780 went from 3.7 kH/s to 5.0 kH/s on Yacoin with -L 5. (-L 4 produces 4.948, -L 3 4.535, and -L 2 was almost no improvement)
[2014-01-18 22:09:22] GPU #0: Performing auto-tuning (Patience...)
[2014-01-18 22:09:22] GPU #0: cudaError 2 (out of memory) calling 'cudaMalloc((void **) &d_idata, mem_size)' (salsa_kernel.cu line 499)
can't wait to try -L on my three 780Ti cards at home - hoping for 5-6 kHash/s per device. Right now I am at a meeting of computer geeks demo'ing one of my mining rigs...
I will have to improve the memory management a lot, both on Windows and on Linux. This out of memory problem is annoying.