Post
Topic
Board Mining (Altcoins)
Re: Softcrash watchdog
by
IAmNotAJeep
on 10/07/2017, 17:24:38 UTC
Hey fullzero, i have a question,

without a doubt my biggest problem right now is that when my miner crashes it takes the whole rig down with it, everything gets stuck, SSH barely works, average system load jumps to 14.5!! and Xorg takes up 100% of the CPU, its so bad that none of the standard reboot commands work, they just do nothing, the only thing that actually reboots the rig in this state is "echo b > /proc/sysrq-trigger" so i've set up a script that checks the average system load and if its over 2 it uses the command to reboot, and it works, but i dont like this "solution", yesterday after a reboot nvOC got corrupted somehow, lost my customized oneBash and the whole system became read-only (thankfully i had a oneBash backup that was only a few days behind).

so the question is, what can i do to relive this Xorg error, i run a 7 card rig and never plan on going for a higher number, what can i do with Xorg that would fix this?

Thanks.

@ tempgoga

It seems that whenever a soft crash occurs most of the cards drop to zero, so while the display/keyboard is unresponsive you can catch the soft crash from nvidia-smi. The script below checks card utilization, if it drops below 90% it counts down a minute and if mining hasn't resumed it reboots the system.
This seems to have worked at least once in my case (only got one soft crash this weekend) and the system recovered as expected.
the threshold values work for my setup but others may find different values optimal

Also if anyone knows a way to iterate the if && statements we can get the card count from "cards=$(nvidia-smi -L | wc -l); echo $cards" but the way below also works with manual editing to adjust the watchdog for the number of cards in you individual system.
___________
 
#!/bin/bash
#m1
threshold=90
while sleep 5
 do number=$(nvidia-smi |grep % |awk '{print $13}' |tr -d %)
 set -- $number
 echo -e "$@"
# The "if and" statements below need to be manually adjusted to match the number of cards in your system
# If you have 5 cards, leave is as, if a different number of cards remove or add the && statements as needed as in the example below
        if [[ "$1" -gt "$threshold" ]] && \
           [[ "$2" -gt "$threshold" ]] && \
           [[ "$3" -gt "$threshold" ]] && \
           [[ "$4" -gt "$threshold" ]] && \
           [[ "$5" -gt "$threshold" ]]
# && \
#          [[ "$6" -gt "$threshold" ]]
         then i=12
         echo OK
         else echo $((i--))
        fi
        if [ $i -le 0 ]
         then echo $(date) REBOOT due to soft crash >>~/watchdog.log
         sleep -5
         sudo shutdown now -r
        fi
done
___________