Maybe It is because different numbers of SM on 970, but I hardcoded BLOCKS and THREADS number, due to silentarmy algo worksize=NR_ROWS, so in cuda blocks=NR_ROWS/THREAD_PER_BLOCK. I don't have 970 cards, but a have 980, will take a look tomorrow.
I have fixed it now. I have started modding, and get 50 sol/s on the gtx 970 and 75sol/s on the 1070 standard clocks.
The power usage is only 60Watt in the wall for the 970 and 95W on the 1070.