Thank you for the in depth answer

Any thoughts to get it faster? the miner reports a compute capability of 1.1 (is that correct for this card?)
and I've tried without autotune and it gets between 7 and 9 KHash
you might want to try -l S14x3 or -l S28x3. It's an exact multiple of your multiprocessor count, and running 3 warps per multiprocessor.
Christian