Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0019-1.4

Quote from: Stubo on November 28, 2017, 05:58:34 PM

Quote from: moofone on November 28, 2017, 04:56:57 PM

Completely disagree. This is dangerous way to overclock and could lead to catastrophic failure of your rig if a card dies on its own.. and they do.

Its discussed on the nvidia dev form with some python code that could be adopted to nvOC if anyone is interested:

https://devtalk.nvidia.com/default/topic/769851/multi-nvidia-gpus-and-xorg-conf-how-to-account-for-pci-bus-busid-change-/

Quote from: Bibi187 on November 28, 2017, 04:51:02 PM

Quote from: moofone on November 28, 2017, 04:42:01 PM

Hi Guys,

I discovered a serious and potentially dangerous flaw in the way nvOC handles overclocking and would like to make a suggestion for an improvement.

We really need overclocking tied to the specific pcie slot (bus id) not an index that changes every time your hardware changes.

For example, if you have a gtx1080ti in slot 2, and a gtx1060 in slot 3, and your 1080ti goes offline for some reason or you remove it, the 1080ti overclock is now applied to what it thinks is the next card in the dumb index, and applies it to your gtx1060 potentially going POOF.

We need to apply overclocking to BUS ID:
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:02:00.0 Off | N/A |
| 70% 56C P2 152W / 151W | 652MiB / 8112MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 106... Off | 00000000:04:00.0 Off | N/A |
| 70% 61C P2 120W / 120W | 592MiB / 6072MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1070 Off | 00000000:05:00.0 Off | N/A |
| 70% 52C P2 118W / 120W | 614MiB / 8113MiB | 99% Default |
+-------------------------------+----------------------+----------------------+

Nothing to fix at all oO ...

You modified your RIG, you have to modify setting ...

How is OC by slot going to fix the scenario where a person just moves cards around in a rig as opposed to just removing one? Both scenarios are hardware changes and common sense dictates that the user be aware of this potential because they went down the path of path of individual OC in the first place. It is not like they went there by mistake, right?

I think the concern is about when no changes are intentionally made.

Example: I have 12 cards in a rig. One card dies completely, mining stops, WDOG restarts the rig...

Rig comes back up, but the dead card is not recognized at all. GPU numbering is now different. Now some OC settings are wrong, may be applying power/fans/OC inappropriately, perhaps making the rig unstable or putting more hardware at risk....