Re: Softcrash watchdog

Hey fullzero, i have a question,

without a doubt my biggest problem right now is that when my miner crashes it takes the whole rig down with it, everything gets stuck, SSH barely works, average system load jumps to 14.5!! and Xorg takes up 100% of the CPU, its so bad that none of the standard reboot commands work, they just do nothing, the only thing that actually reboots the rig in this state is "echo b > /proc/sysrq-trigger" so i've set up a script that checks the average system load and if its over 2 it uses the command to reboot, and it works, but i dont like this "solution", yesterday after a reboot nvOC got corrupted somehow, lost my customized oneBash and the whole system became read-only (thankfully i had a oneBash backup that was only a few days behind).

so the question is, what can i do to relive this Xorg error, i run a 7 card rig and never plan on going for a higher number, what can i do with Xorg that would fix this?

Thanks.

@ tempgoga

It seems that whenever a soft crash occurs most of the cards drop to zero, so while the display/keyboard is unresponsive you can catch the soft crash from nvidia-smi. The script below checks card utilization, if it drops below 90% it counts down a minute and if mining hasn't resumed it reboots the system.
This seems to have worked at least once in my case (only got one soft crash this weekend) and the system recovered as expected.
the threshold values work for my setup but others may find different values optimal

Also if anyone knows a way to iterate the if && statements we can get the card count from "cards=$(nvidia-smi -L | wc -l); echo $cards" but the way below also works with manual editing to adjust the watchdog for the number of cards in you individual system.
___________

#!/bin/bash
#m1
threshold=90
while sleep 5
do number=$(nvidia-smi |grep % |awk '{print $13}' |tr -d %)
set -- $number
echo -e "$@"
# The "if and" statements below need to be manually adjusted to match the number of cards in your system
# If you have 5 cards, leave is as, if a different number of cards remove or add the && statements as needed as in the example below
if [[ "$1" -gt "$threshold" ]] && \
[[ "$2" -gt "$threshold" ]] && \
[[ "$3" -gt "$threshold" ]] && \
[[ "$4" -gt "$threshold" ]] && \
[[ "$5" -gt "$threshold" ]]
# && \
# [[ "$6" -gt "$threshold" ]]
then i=12
echo OK
else echo $((i--))
fi
if [ $i -le 0 ]
then echo $(date) REBOOT due to soft crash >>~/watchdog.log
sleep -5
sudo shutdown now -r
fi
done
___________

Hey thats funny I just made a script doing something similar, although it checks the powerdraw.
Here it is:

Code:

#!/bin/bash

# Miner restart script V001
# By Maxximus007
# for nvOC by fullzero
#
# POWERLIMIT MUST BE SET IN oneBash

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
touch "$LOG_FILE"
fi

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT_LOW_POWER=0

while [ $gpu -lt $GPUS ]
do
{ IFS=', ' read POWERDRAW POWERLIMIT; } < <( nvidia-smi -i $gpu --query-gpu=power.draw,power.limit --format=csv,noheader,nounits)

let POWER_DIFF=$( printf "%.0f" $POWERLIMIT )-$( printf "%.0f" $POWERDRAW )

# If current draw is 30 Watt lower than the limit count them:
if [ "$POWER_DIFF" -gt "30" ]
then
let COUNT_LOW_POWER=COUNT_LOW_POWER+1
fi

let gpu=gpu+1
done

if [ $COUNT_LOW_POWER -eq $GPUS ]
then
echo "$(date) - Power draw is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
# If miner runs in screen 'miner' kill the screen
screen -X -S miner kill
# Best to restart oneBash - settings might be adjusted already
kill ps -ef | awk '$NF~"oneBash" {print $2}'
else
echo "$(date) - All good! Will check again in 60 seconds"
fi

done

You can combine the above with your code, and find the utilization like this:

Code:

nvidia-smi -i 1 --query-gpu=utilization.gpu --format=csv,noheader,nounits

You have to iterate the GPU, starting at 0 to get them all

Okay I've combined the two, perhaps this will work for most of us:

Code:

#!/bin/bash

# Miner restart script V002
# By Maxximus007 && IAmNotAJeep
# for nvOC by fullzero
#

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
touch "$LOG_FILE"
fi

MIN_UTIL=90
RESTART=0

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT=0

while [ $gpu -lt $GPUS ]
do
{ IFS=', ' read UTIL; } < <( nvidia-smi -i $gpu --query-gpu=utilization.gpu --format=csv,noheader,nounits)

let UTILIZATION=$( printf "%.0f" $UTIL )

# If current utilizations lower than the limit count them:
if [ $UTILIZATION -lt $MIN_UTIL ]
then
let COUNT=COUNT+1
fi

let gpu=gpu+1
done

if [ $COUNT -eq $GPUS ]
then
if [ $RESTART -gt 1 ]
then
echo "$(date) - Utilization is too low: reviving did not work so restarting system" | tee -a ${LOG_FILE}
sudo shutdown now -r
fi
echo "$(date) - Utilization is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
# If miner runs in screen 'miner' kill the screen
screen -X -S miner kill
# Best to restart oneBash - settings might be adjusted already
kill ps -ef | awk '$NF~"oneBash" {print $2}'
let RESTART=RESTART+1
else
echo "$(date) - All good! Will check again in 60 seconds"
fi

done