Post
Topic
Board Mining (Altcoins)
Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0018
by
fullzero
on 24/07/2017, 23:48:57 UTC
Hi All,

A GPU was lost in a rig this weekend, and unfortunately the watchdog did not reboot the rig. After inspection, nvidia-smi is not just reporting the new number of GPU's, but gives a warning. Watchdog was not ready for this message, and just errored.

Therefore a new code block for Watchdog:
Code:
 numtest='^[0-9]+$'
  
  for UTIL in $UTILIZATIONS
  do
    if ! [[ $UTIL =~ $numtest ]]
    then
        # Not numeric so: Help we've lost a GPU, so reboot
        echo "$(date) - Lost GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}
        echo "" | tee -a ${LOG_FILE}
        #Hope PCI BUS info will help find the faulty GPU
        nvidia-smi --query-gpu=gpu_bus_id --format=csv | tee -a ${LOG_FILE}
        echo "reboot in 10 seconds"
        echo ""
        sleep 10
        sudo reboot
    fi

    # If utilization is lower than threshold count them:
    if [ $UTIL -lt $THRESHOLD ]
    then
      echo "$(date) - GPU under threshold found"
      echo ""
      let COUNT=COUNT-1
    fi
    let GPU=GPU+1
  done

just replace the old "for UTIL in $UTILIZATIONS.." block with the new one.

I don't think I've had a GPU fall off the bus yet.  I'll make the changes for the next 1bash plus files version.