I am working on #2. My 1070 with a single 8 pin power is trottleing, (core clock is reduced dynamically)
I rewrote a bit and was able to push it from 250 to 262 MHASH on the standard clocks, but compute 5.2 cards are slower.
I think I need to do 2 separate kernels.