I must say that I didn't have any problem benchmarking amd cards: if the room was hot, I'd put the fans at high speed, run the miner and wait a couple minute to stabilize, and that's it. I could make 100 changes to a kernel in a day and check them all, accurately.
On nvidia I have throttling problems I can't easily fix (the cards reduce clock speed in a number of situations I just can't predict), overclocking/downclocking is more difficult as the cards tend to change clocks by themselves, and the hashrates fluctuates wildly, and even changes between ccminer runs.
The rig is headless so I only have nvidia-smi to work with, and it can't set the fan speed.
So when I make a little kernel speedup, I spend more time benchmarking it (to be sure it's indeed an improvement), than making the improvement itself :-/
Maybe there are some nvidia-smi settings to make it more stable?
Or maybe on windows it's different...
Finally I may need a workstation with a nvidia as main card, and work on it.
Buy another 970 card. The gigabyte windforce oc never trottle and mines on a stable clockrate. easy to verify speedups. For very small changes, taka a look at the generated PTX assembly code, less code lines is bether but not always..
You can also test your chances on a big rig with many cards, If you have many cards small speedups of 0-1 KHASH per card can be visible.
To finance the cards, you can hope that you will ROI the cards in 1 year by increasing the speed of the kernals
