I have been using TRM for over 6 months with very happy results on my Ethereum mining farm. The most recent release has been causing random GPU mining crashes on rigs that have been stable for months. The problem with the crashes is 0.7.17 frequently hangs and just sits there not mining until the user issues a system restart. I have had 10-20 rigs not mining for hours due to this.
I recently went back to 0.7.15 and all the issues went away. I have been running a farm for about four years now and know when there is an issue with mining software versus GPU crashes due to OC, riser, voltage issues, etc...
Here is SMOS log of one of the many crashes after 0.7.17 upgrade on all my machines.
[2020-11-10 04:36:13] Pool us2.ethermine.org received new job. (job_id: 0x238cea67918072b4b145002a593cb77015079123ffb74ce84a47d8ff1f78aafc)
[2020-11-10 04:36:14] Watchdog triggering miner shutdown after restart script execution.
[2020-11-10 04:36:14] Shutting down...
[2020-11-10 04:36:14] Watchdog thread exiting.
[2020-11-10 04:36:14] GPU10 thread exiting.
[2020-11-10 04:36:14] GPU 9 thread exiting.
[2020-11-10 04:36:14] GPU12 thread exiting.
[2020-11-10 04:36:14] GPU 2 thread exiting.
[2020-11-10 04:36:14] GPU 1 thread exiting.
[2020-11-10 04:36:14] GPU11 thread exiting.
[2020-11-10 04:36:14] GPU 3 thread exiting.
[2020-11-10 04:36:14] GPU 6 thread exiting.
[2020-11-10 04:36:14] GPU 7 thread exiting.
[2020-11-10 04:36:14] GPU 0 thread exiting.
[2020-11-10 04:36:14] GPU 8 thread exiting.
[2020-11-10 04:36:14] GPU 5 thread exiting.
[2020-11-10 04:36:24] GPU 4 thread 0 shutdown timed out.
[2020-11-10 04:36:24] Successful clean shutdown.
Miner ended or crashed. Restarting miner in 30 seconds...
Hi! Any chance you can hunt me down on discord to do some one-one-one troubleshooting? I would love to get more data here, a full log as produced by —log_file would be great as a start. There are zero kernel changes between these two versions, so gpu stability isn’t really expected to be affected. I also wonder what watchdog/restart script is executed above. Afaik SMOS normally run their own script, but since you don’t even get a proper reboot above something is weird. Also, it looks like gpu 4 is stuck above, would be interesting to hear if there are any kernel/dmesg logs of interest or if this could even be a host-side hang.