Post
Topic
Board Mining (Altcoins)
Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0018
by
argonaute
on 30/07/2017, 17:26:53 UTC
Hi Fullzero

I want to share with you a GPU failed that the watchdog is not able to detect

wdog screen:

GPU UTILIZATION:  Unable to determine the device handle for GPU 0000:09:00.0: GPU is lost. Reboot the system to recover this GPU

/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: Unable: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: to: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: determine: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: the: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: device: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: handle: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: for: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: GPU: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: 0000:09:00.0:: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: GPU: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: is: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: lost.: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: Reboot: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: the: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: system: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: to: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: recover: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: this: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: GPU: integer expression expected
Tue Jul 25 16:57:01 CEST 2017 - All good! Will check again in 60 seconds


GPU UTILIZATION:  Unable to determine the device handle for GPU 0000:09:00.0: GPU is lost. Reboot the system to recover this GPU

/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: Unable: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: to: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: determine: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: the: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: device: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: handle: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: for: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: GPU: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: 0000:09:00.0:: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: GPU: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: is: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: lost.: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: Reboot: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: the: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: system: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: to: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: recover: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: this: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: GPU: integer expression expected
Tue Jul 25 16:58:01 CEST 2017 - All good! Will check again in 60 seconds


the miner show/detect only 6 GPU over 7

nvidia-smi doesn't work
$ nvidia-smi
Unable to determine the device handle for GPU 0000:09:00.0: GPU is lost.  Reboot the system to recover this GPU

temp screen:
Provided power limit 75.00 W is not a valid power limit which should be between 115.00 W and 291.00 W for GPU 00000000:0A:00.0
Terminating early due to previous errors.
Tue Jul 25 17:01:07 CEST 2017 - All good, will check again soon

GPU 0, Target temp: 61, Current: 60, Diff: 1, Fan: 75, Power: 123.46

GPU 1, Target temp: 61, Current: 60, Diff: 1, Fan: 63, Power: 124.62

GPU 2, Target temp: 61, Current: 59, Diff: 2, Fan: 77, Power: 119.23

GPU 3, Target temp: 61, Current: 60, Diff: 1, Fan: 68, Power: 120.72

GPU 4, Target temp: 61, Current: 59, Diff: 2, Fan: 57, Power: 124.26

GPU 5, Target temp: 61, Current: Unable, Diff: 61, Fan: to, Power: determine

/home/m1/Maxximus007_AUTO_TEMPERATURE_CONTROL: line 125: [: Unable: integer expression expected
/home/m1/Maxximus007_AUTO_TEMPERATURE_CONTROL: line 158: [: the: integer expression expected
/home/m1/Maxximus007_AUTO_TEMPERATURE_CONTROL: line 171: [: to: integer expression expected
GPU 6, Target temp: 61, Current: 55, Diff: 6, Fan: 50, Power: 126.76

Tue Jul 25 17:01:37 CEST 2017 - Restoring Power limit for gpu:6. Old limit: 125 New limit: 75 Fan speed: 50

Provided power limit 75.00 W is not a valid power limit which should be between 115.00 W and 291.00 W for GPU 00000000:0A:00.0
Terminating early due to previous errors.
Tue Jul 25 17:01:37 CEST 2017 - All good, will check again soon


I believe this is the exact problem that Maxximus007 recently made a new code block to resolve.

Fullzero,

I'm getting this error as well, and looks like watchdog is not rebooting the system.
I believe I have the latest bash files.
are Maxximus007's changes to resolve this issue in the current bash files?
Thank you.



GPU UTILIZATION:  Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU

/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: Unable: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: to: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: determine: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: the: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: device: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: handle: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: for: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: GPU: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: 0000:01:00.0:: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: GPU: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: is: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: lost.: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: Reboot: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: the: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: system: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: to: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: recover: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: this: integer expression expected
/home/m1/IAmNotAJeep_and_Maxximus007_WATCHDOG: line 44: [: GPU: integer expression expected
Sat Jul 29 21:07:09 PDT 2017 - All good! Will check again in 60 seconds



I'm experiencing same issues sometimes, that's how I fixed, I made a python script to check for GPUs failure and reboots if it's not mining with all the GPUs,

nano deepredchecker.py
Paste the following:

#!/bin/python

import os

card_status = os.popen("nvidia-smi").read()

if "reboot" in card_status:
    os.popen("sudo shutdown -r now")
else:
    exit()


Type: chmod a+x deepredchecker.py
Type: crontab -e
Paste: */5 * * * * python /home/m1/deepredchecker.py > /dev/null
 At the bottom and save crontab.

Now, every 5 minutes (you can adjust it) it will check if all of your cards are up and running, in case of problems it will reboot your rig, so far this is making my life easier. Hope it helps you

I like that !! Smiley

thanks for the share