On a related note: I'm assuming that my "workaround" for misaligned address bug on 2xxx and 3xxx series cards is working for you? (I never had thes cards, so I couldn't check myself). I'm assuming it's working for you because you published speed specs, but I just want to make sure it doesn't crash mid-computation.
Honestly speaking I did not launch very long computations, so I cannot say if it crashes after one or seven hours, but for a few minutes it works smoothly

I think I just wanted to have a ready solution to be build with newer cuda and for the higher ccap.