If we are talking speed, I have a program (a variation of BitCrack) that will get around 400 MKey/s (per GPU) but honestly, that is to slow. I would like to get around 1,200 MKey/s (per card, low end 30xx card) and multiple cards per instance.
Ok I can help you with coding if I think it works.