Completely disagree. This is dangerous way to overclock and could lead to catastrophic failure of your rig if a card dies on its own.. and they do.
Its discussed on the nvidia dev form with some python code that could be adopted to nvOC if anyone is interested:
https://devtalk.nvidia.com/default/topic/769851/multi-nvidia-gpus-and-xorg-conf-how-to-account-for-pci-bus-busid-change-/Hi Guys,
I discovered a serious and potentially dangerous flaw in the way nvOC handles overclocking and would like to make a suggestion for an improvement.
We really need overclocking tied to the specific pcie slot (bus id) not an index that changes every time your hardware changes.
For example, if you have a gtx1080ti in slot 2, and a gtx1060 in slot 3, and your 1080ti goes offline for some reason or you remove it, the 1080ti overclock is now applied to what it thinks is the next card in the dumb index, and applies it to your gtx1060 potentially going POOF.
We need to apply overclocking to BUS ID:
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:02:00.0 Off | N/A |
| 70% 56C P2 152W / 151W | 652MiB / 8112MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 106... Off | 00000000:04:00.0 Off | N/A |
| 70% 61C P2 120W / 120W | 592MiB / 6072MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1070 Off | 00000000:05:00.0 Off | N/A |
| 70% 52C P2 118W / 120W | 614MiB / 8113MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
Nothing to fix at all oO ...
You modified your RIG, you have to modify setting ...
How is OC by slot going to fix the scenario where a person just moves cards around in a rig as opposed to just removing one? Both scenarios are hardware changes and common sense dictates that the user be aware of this potential because they went down the path of path of individual OC in the first place. It is not like they went there by mistake, right?
I think the concern is about when no changes are intentionally made.
Example: I have 12 cards in a rig. One card dies completely, mining stops, WDOG restarts the rig...
Rig comes back up, but the dead card is not recognized at all. GPU numbering is now different. Now some OC settings are wrong, may be applying power/fans/OC inappropriately, perhaps making the rig unstable or putting more hardware at risk....
Well, I don't think you will put the HW at risk but it could certainly screw up some OC settings. My understanding is that Nvidia builds their GPU's with several different fail-safe mechanisms to prevent this as do the vendors who build and warranty them. Consider the scenario where we attempt to apply a PL that is too high. Here is an old 970 on my test rig whose limit is 220 watts and I try to push it to 225:
m1@Testy:~$ # Overpower a GPU
m1@Testy:~$ nvidia-smi --query-gpu=name,pstate,temperature.gpu,fan.speed,utilization.gpu,power.draw,power.limit --format=csv
name, pstate, temperature.gpu, fan.speed [%], utilization.gpu [%], power.draw [W], power.limit [W]
GeForce GTX 970, P2, 49, 50 %, 100 %, 114.22 W, 115.00 W
m1@Testy:~$ # Find max power card can handle
m1@Testy:~$ nvidia-smi -a |grep "Max Power"
Max Power Limit : 220.00 W
m1@Testy:~$ # Set GPU Power above this
m1@Testy:~$ sudo nvidia-smi -pl 225
Provided power limit 225.00 W is not a valid power limit which should be between 100.00 W and 220.00 W for GPU 00000000:01:00.0
Terminating early due to previous errors.
Has anybody seen a HW failure from this or is this just theoretical?