Error on rig 2 - Two different rigs, crashing within a minute of each other. Tell me that isn't weird.
tail -f /home/m1/nvoc_logs/watchdog-screenlog.0
m1@m1-desktop:~$ tail -f /home/m1/nvoc_logs/watchdog-screenlog.0
Watchdog for nvOC v0019-2.0 - Community Release
Version: v0019-2.0.011
LOG FILE: (Showing the last 10 recorded entries)
| 12 | 120W | 3.42 Sol/W |
+-----+-------------+--------------+
INFO 09:34:50: GPU3 Accepted share 186ms [A:454, R:1]
INFO 09:34:51: GPU7 Accepted share 187ms [A:477, R:1]
CRITICAL: Sun Apr 29 09:35:17 MST 2018 - GPU Utilization is too low: restarting 3main...
Mon Apr 30 22:35:29 MST 2018 - Lost GPU so restarting system. Found GPU's:
Unable to determine the device handle for GPU 0000:0F:00.0: GPU is lost. Reboot the system to recover this GPU
Mon Apr 30 22:35:30 MST 2018 - reboot in 10 seconds
If both rigs crash and freeze at the same time, it can be electrical problem
I had almost same issue a while back and some of my rigs were crashing all at the same time,
found out when one of the room venting fans was turning on it was making a high frequency noise in electricity and 3-4 rigs gets the lost gpu at the same time and reboot.
After a month of pulling my hairs to find the problem I changed that fan and problem solved.
Open 5watchdog
Change:
echo "$(date) - Lost GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}
To:
echo "$(date) - Lost GPU $GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}
So you can check GPU number that is lost, then check if it is the same GPU always get lost?
If its always the same GPU, remove it from the rig and check, may be a faulty GPU, riser or power cable.
If after removing the GPU, the problem jumps to another GPU then it could be a power problem.
Hmmmm interesting about the electrical issue. I have two 8" hyperfans used as exhaust fans that run 100% 24/7. But could be a possibility.
Last night I recompiled the miners again, set the max fan limit to 90 from another one of your posts and set the power restore to 80 and changed EWBF to 3_3 from 3_4.
I will try this for a day and see if anything happens. It's just so strange that on Hush, it worked completely fine.
I had this kinda happen before and it was the mining server "disconnecting". Switched pools and all was good.
Any way to make watchdog wait an extended period of time for error's to clear themselves before trying to restart 3main?
Also, do the miners themselves if watch dog is disabled keep a max temp limit? I notice when starting EWBF it says max temp 90*. While I don't want temps that high, if it keeps the miner going and safe then I will consider it.
The watchdog and the temp control are 2 different scripts so even if you disable the watchdog, the temp control will still do its thing. If you want to expand the time between checks for the watchdog, change the interval of the main loop. At the bottom of the script, you will see this line:
Change this to a larger value like 15 or 20. NOTE that increasing this value on a rig with a lot of GPUs will dramatically increase the amount of time before the watchdog bounces the miner in the event that a problem is detected on a single GPU.