Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0019-1.4

Hi Stubo,

If you move a cards, that's a conscious effort and I would expect in that case you have to track that by updating 1bash. In this scenario you are changing only two things, so that seems reasonable to me

If we keep using the existing way, potentially all the cards have been re-indexed by taking one out

In the scenario where one fails, it is the same problem, they may all be reindexed if its the first card. At least using a per-bus-id method, you have nothing at all to do in this case as its still correct.

Another option would be overclock per card UUID, but I think that overcomplicates things. But that may work by generating a map between UUID and bus-id once you get things setup the way you want, run a command to build that map then it doesn't matter where you move cards around - it will track them. Add a new card? re-run the map building script... but again kind of overcomplicated. I think by bus-id is best and the most logical.

Quote from: Stubo on November 28, 2017, 05:58:34 PM

Quote from: moofone on November 28, 2017, 04:56:57 PM

Completely disagree. This is dangerous way to overclock and could lead to catastrophic failure of your rig if a card dies on its own.. and they do.

Its discussed on the nvidia dev form with some python code that could be adopted to nvOC if anyone is interested:

https://devtalk.nvidia.com/default/topic/769851/multi-nvidia-gpus-and-xorg-conf-how-to-account-for-pci-bus-busid-change-/

Quote from: Bibi187 on November 28, 2017, 04:51:02 PM

Quote from: moofone on November 28, 2017, 04:42:01 PM

Hi Guys,

I discovered a serious and potentially dangerous flaw in the way nvOC handles overclocking and would like to make a suggestion for an improvement.

We really need overclocking tied to the specific pcie slot (bus id) not an index that changes every time your hardware changes.

For example, if you have a gtx1080ti in slot 2, and a gtx1060 in slot 3, and your 1080ti goes offline for some reason or you remove it, the 1080ti overclock is now applied to what it thinks is the next card in the dumb index, and applies it to your gtx1060 potentially going POOF.

We need to apply overclocking to BUS ID:
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:02:00.0 Off | N/A |
| 70% 56C P2 152W / 151W | 652MiB / 8112MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 106... Off | 00000000:04:00.0 Off | N/A |
| 70% 61C P2 120W / 120W | 592MiB / 6072MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1070 Off | 00000000:05:00.0 Off | N/A |
| 70% 52C P2 118W / 120W | 614MiB / 8113MiB | 99% Default |
+-------------------------------+----------------------+----------------------+

Nothing to fix at all oO ...

You modified your RIG, you have to modify setting ...

How is OC by slot going to fix the scenario where a person just moves cards around in a rig as opposed to just removing one? Both scenarios are hardware changes and common sense dictates that the user be aware of this potential because they went down the path of path of individual OC in the first place. It is not like they went there by mistake, right?