Post
Topic
Board Mining (Altcoins)
Re: Softcrash watchdog
by
Maxximus007
on 10/07/2017, 18:09:08 UTC
Hey fullzero, i have a question,

without a doubt my biggest problem right now is that when my miner crashes it takes the whole rig down with it, everything gets stuck, SSH barely works, average system load jumps to 14.5!! and Xorg takes up 100% of the CPU, its so bad that none of the standard reboot commands work, they just do nothing, the only thing that actually reboots the rig in this state is "echo b > /proc/sysrq-trigger" so i've set up a script that checks the average system load and if its over 2 it uses the command to reboot, and it works, but i dont like this "solution", yesterday after a reboot nvOC got corrupted somehow, lost my customized oneBash and the whole system became read-only (thankfully i had a oneBash backup that was only a few days behind).

so the question is, what can i do to relive this Xorg error, i run a 7 card rig and never plan on going for a higher number, what can i do with Xorg that would fix this?

Thanks.

@ tempgoga

It seems that whenever a soft crash occurs most of the cards drop to zero, so while the display/keyboard is unresponsive you can catch the soft crash from nvidia-smi. The script below checks card utilization, if it drops below 90% it counts down a minute and if mining hasn't resumed it reboots the system.
This seems to have worked at least once in my case (only got one soft crash this weekend) and the system recovered as expected.
the threshold values work for my setup but others may find different values optimal

Also if anyone knows a way to iterate the if && statements we can get the card count from "cards=$(nvidia-smi -L | wc -l); echo $cards" but the way below also works with manual editing to adjust the watchdog for the number of cards in you individual system.
___________
 
#!/bin/bash
#m1
threshold=90
while sleep 5
 do number=$(nvidia-smi |grep % |awk '{print $13}' |tr -d %)
 set -- $number
 echo -e "$@"
# The "if and" statements below need to be manually adjusted to match the number of cards in your system
# If you have 5 cards, leave is as, if a different number of cards remove or add the && statements as needed as in the example below
        if [[ "$1" -gt "$threshold" ]] && \
           [[ "$2" -gt "$threshold" ]] && \
           [[ "$3" -gt "$threshold" ]] && \
           [[ "$4" -gt "$threshold" ]] && \
           [[ "$5" -gt "$threshold" ]]
# && \
#          [[ "$6" -gt "$threshold" ]]
         then i=12
         echo OK
         else echo $((i--))
        fi
        if [ $i -le 0 ]
         then echo $(date) REBOOT due to soft crash >>~/watchdog.log
         sleep -5
         sudo shutdown now -r
        fi
done
___________

Hey thats funny I just made a script doing something similar, although it checks the powerdraw.
Here it is:
Code:
#!/bin/bash

# Miner restart script V001
# By Maxximus007
# for nvOC by fullzero
#
# POWERLIMIT MUST BE SET IN oneBash

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
    touch "$LOG_FILE"
fi

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT_LOW_POWER=0

while [ $gpu -lt $GPUS ]
do
  { IFS=', ' read POWERDRAW POWERLIMIT; } < <( nvidia-smi -i $gpu --query-gpu=power.draw,power.limit --format=csv,noheader,nounits)

  let POWER_DIFF=$( printf "%.0f" $POWERLIMIT )-$( printf "%.0f" $POWERDRAW )

  # If current draw is 30 Watt lower than the limit count them:
  if [ "$POWER_DIFF" -gt "30" ]
  then
    let COUNT_LOW_POWER=COUNT_LOW_POWER+1
  fi

  let gpu=gpu+1
done

if [ $COUNT_LOW_POWER -eq $GPUS ]
then
  echo "$(date) - Power draw is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
  # If miner runs in screen 'miner' kill the screen
  screen -X -S miner kill
  # Best to restart oneBash - settings might be adjusted already
  kill ps -ef | awk '$NF~"oneBash" {print $2}'
else
  echo "$(date) - All good! Will check again in 60 seconds"
fi

done

You can combine the above with your code, and find the utilization like this:
Code:
nvidia-smi -i 1 --query-gpu=utilization.gpu --format=csv,noheader,nounits
You have to iterate the GPU, starting at 0 to get them all
Okay I've combined the two, perhaps this will work for most of us:
Code:
#!/bin/bash

# Miner restart script V002
# By Maxximus007 && IAmNotAJeep
# for nvOC by fullzero
#

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
    touch "$LOG_FILE"
fi

MIN_UTIL=90
RESTART=0

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT=0

while [ $gpu -lt $GPUS ]
do
  { IFS=', ' read UTIL; } < <( nvidia-smi -i $gpu --query-gpu=utilization.gpu --format=csv,noheader,nounits)

  let UTILIZATION=$( printf "%.0f" $UTIL )

  # If current utilizations lower than the limit count them:
  if [ $UTILIZATION -lt $MIN_UTIL ]
  then
    let COUNT=COUNT+1
  fi

  let gpu=gpu+1
done

if [ $COUNT -eq $GPUS ]
then
  if [ $RESTART -gt 1 ]
  then
    echo "$(date) - Utilization is too low: reviving did not work so restarting system" | tee -a ${LOG_FILE}
    sudo shutdown now -r
  fi
  echo "$(date) - Utilization is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
  # If miner runs in screen 'miner' kill the screen
  screen -X -S miner kill
  # Best to restart oneBash - settings might be adjusted already
  kill ps -ef | awk '$NF~"oneBash" {print $2}'
  let RESTART=RESTART+1
else
  echo "$(date) - All good! Will check again in 60 seconds"
fi

done