Is there a script for a rig offline telegram notification that Im missing? The telegram messages are awesome, but would be good to know if it powers down and stay down.
If system is going to be rebooted you will get notified from script that send the reboot command (watchdog or temp control), but if your rig freeze there is nothing to do.
You can set up an external watchdog like an RPi to check your rigs ping and ssh port and send you message if they were unreachable.
Hi
About this, does nvOC use the linux kernel watchdog ?
Unfortunately no.
Actually a card in one of my rigs making it hard freeze lately and I was reading and playing with sysctl to do a reboot on kernel panic with no success.
I changed
sudo reboot
in both watchdog and temp control to
sudo systemctl reboot --force
Didnt helped a bit.
Also added to /etc/sysctl.conf
kernel.panic = 1
And no help too,
The system freezes after 3-4 days of running clean with no logs, just a hard freeze.
Any one has good ideas to implement linux hardware kernel watchdog would be amazing to add ...
My solution to this problem is to do a HW reset with a Raspberry Pi.
Basically you add a small circuit (a resistor an an optocoupler) to one of the GPIO output of the Raspberry PI and you connect the output of the optocoupler to the reset switch pins of the motherboard. In other words you allow the Raspberry Pi to "close" the reset switch of the motherboard.
On the RPi you run a script that checks for the SSH port on the mobo every 30 seconds or so (you could alternatively ping the motherboard). If you have 10 consecutive failures, the RPi resets the mobo.
I could write a how to guide on this thread and post the schematics if you are interested.
I'm actually making an electrical switchboard for my rigs to connect to an RPi+relay board to reboot rigs and also turn them on/off if needed.
But I'm looking into making linux kernel watchdog works so we dont need external watchdogs anymore.
The rig I'm talking about is a test rig at home with a 1070 and a p106 card which were failing on my main rigs and I brought them home for testing failures.