Hi All,
A GPU was lost in a rig this weekend, and unfortunately the watchdog did not reboot the rig. After inspection, nvidia-smi is not just reporting the new number of GPU's, but gives a warning. Watchdog was not ready for this message, and just errored.
Therefore a new code block for Watchdog:
numtest='^[0-9]+$'
for UTIL in $UTILIZATIONS
do
if ! [[ $UTIL =~ $numtest ]]
then
# Not numeric so: Help we've lost a GPU, so reboot
echo "$(date) - Lost GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}
echo "" | tee -a ${LOG_FILE}
#Hope PCI BUS info will help find the faulty GPU
nvidia-smi --query-gpu=gpu_bus_id --format=csv | tee -a ${LOG_FILE}
echo "reboot in 10 seconds"
echo ""
sleep 10
sudo reboot
fi
# If utilization is lower than threshold count them:
if [ $UTIL -lt $THRESHOLD ]
then
echo "$(date) - GPU under threshold found"
echo ""
let COUNT=COUNT-1
fi
let GPU=GPU+1
done
just replace the old "for UTIL in $UTILIZATIONS.." block with the new one.