How are people monitoring rigs on smOS? Today I had a situation where the rig went down for 5 hrs (overnight) . My pool has email notifications when worker offline but that's not guaranteed to work and its just an email that is buried with all the other emails received - not a good way to catch and resolve issues.
Also would enabling /var/log/messages help to determine why smOS is crashing? I dont have aggressive OC settings. 0,1100,70 mostly 1060's with temps around 71C (which is kind of high i think).
Thanks in advance
Personally I have several rigs, and one goes down maybe once every two months. It's not something that I've put much effort into creating a notification for. Make sure that if you use Claymore that you specificy "-r 1" (forces reboot if there's a hang) and "-minspeed XXX" (where XXX is the minimum speed in Mh the miner needs to maintain for 5 minutes otherwise it will be rebooted. Doing this should fix 99% of freezes. Beyond that it's a serious hardware fault. Additionally you can put a smart outlet like a Wemo Insight, and setup IFTTT with a ping to automatically power off/on the rig if there's no ping response. I know people that have done this, but personally even though I have a Wemo Insight on each rig for remote power management I haven't bothered to use this feature since I rarely have a freeze.
To figure out why it's freezing, you can SSH in to read the log, or what I've found works well is to SSH in and keep the connection going. When the rig reboots the connection will drop, and the last things that were on the screen will be visible on your computer that you used to SSH to the miner. you'll likely see one of the GPU has a 0 hashrate. reduce the clock on that card, and repeat until no more freezes.