Here is a little more detail about our watchdog function we built into ZhashOS.
When ever a GPU crashes, we log the details in our zhash.log file including "which" GPU had the problem and what it was.
This makes much easier to diagnose the problem GPU and address it precisely.
Here is one such entry in our log file to explain.
11-08-2018 19:34:15 GPU:0 has low Utilization:17
11-08-2018 19:35:45 GPU:0 has low Utilization:48
11-08-2018 19:35:46 Watchdog has detected a GPU failure. Restarting Miner
As you can see here, when a GPU makes multiple "problem" log entries within a 2 minute time frame, we know it is/has crashed and we take appropriate action to restart it.