UNATTENDED RIG: Automatic overclocking,FANs control,mail sending,HW safe guarded

DEAR BTCITCOINERS:

I am very glad to share with you my big effort carried out during several months together with my dear mining RIG provided of 3 ATI Radeon HD 5870.
This consists on an unattended and automatic controlled Mining RIG, that has provided a lot of benefits for me (hardware safe guard, automatic settings basing on temperature, Fans control, mail sending, etc), I hope it will be useful for others.
Of course I know that mining with GPUs is not really profitable now, but I am also waiting for an ASIC with 30 GH/s!! Grin

. My idea is to adapt all these scripts asapfor the mining with ASIC, and share it with you in case I see you consider this post valuable for you.

Note that I provide all this effort absolutely free!, however I would really grateful of receiving donations to my BTC address. That would encourage me for publishing other ideas and code with the BTC community.
(You can find my address at the end of the post).

First of all, let me indicate the benefits obtained from using these scripts:

All scripts are automatically started when switching on the system
All scripts can be manually started by a single command
All scripts can be manually stopped by a single command
There is a control script in charge of the following actions:
- Automatic overclocking of each GPUs basing on the GPU temperature, target temperature, etc
- Automatic downclocking of each GPUs basing on the GPU temperature, target temperature, etc
- Automatic starting of mining processes in the case they are abnormally stopped. (5 retries)
- Sending of mails if the retry limit is reached, in order to inform of the problem to the user
- Automatic shutting down of the mining processes in the case the temperature is very high. This will prevent a worse hardware failure in the GPUs
- Automatic starting of the mining processes when the temperature is already safe for starting ythe mining
- Automatic FAN control for each GPU from 30% to 100% of speed based on the temperature
There is a monitor script continuously sensing all system parameters for user supervision:
- Current GPUs clocks
- Current FAN clocks
- Current temperature for each GPU
- Temperatures of CPU, Motherboard, etc
- Current hashrate for each GPU
- Last 10 control scripts actions
The monitor script can be used thought ssh or telnet connection. In my case, I supervise my RIG though SSH from my mobile by using 'ConnectBot' application (see android or iPhone markets)

Before using the scripts, have a look to base system. Your system can be different to this, but I provide you all the code so that you can add any correction for adapting all to your system. The base system for running these scripts are the following:

Linux OS (I use Linux debian). All the scripts are coded on Linux shell.
ATI graphic cards (I have 3 ATI HD 5870). Note that I use 'aticonfig' software for almost everything
ATI drivers already installed, AMD SDK, etc. These guides were very useful for me:
- http://eligius.st/wiki/index.php/Ubuntu_Miner_Guide
- http://ewoah.com/technology/a-very-good-guide-to-building-a-bitcoin-mining-rig-cluster-guide/
lm-sensors package already installed (use apt-get or aptitude). This is for retrieving the CPU and motherboard temperatures
Screen package pre-installed (apt-get install screen). This tool will let us to run all scripts from the GUI, but sharing the console output with other sessions. This will let us monitor the system from telnet or SSH, as the console is shared (thanks to screen!!) .
I use poclbm.py for GPU mining. Other mining software will be ok, but you will need to change the related piece of code from the scripts
My system is fully unattended... so I also have installed VNC software for remote connection to the desktop (I have there my wallet)
I also have an automatic login in the system, so that when I switch on the system, the system is logged in, the scripts are automatically started and the wallet application is also started, getting updated with the BTC network transactions. (In my case, the RIG is installed in a cooled place very far from my house)
This script uses "reboot" and "halt" command without sudo password. To get it, read this link: http://sleekmason.wordpress.com/fluxbox/using-etcsudoers-to-allow-shutdownrestart-without-password/
- I use "mail" command for sending mails. mail command must be available for the script (look for it on the internet).
OK, after this brief introduction, let's go to the scripts. I wish you enjoy them!!

I have the following little scripts for starting each of my individual mining process on my GPUs:

gpu0.sh
Code:
#!/bin/bash
export DISPLAY=:0.0
cd /home/your_path/scripts
DISPLAY=:0.0 aticonfig --pplib-cmd "set fanspeed 0 75"
DISPLAY=:0 ./poclbm.py -d0 -v -r 5 -w128 http://your_user@mail.com:your_password@deepbit.net:8332 | tee mining_gpu0.log

As you can see, I use a pool for mining (deepbit). You should change this line with your specific parameters for poclbm.py or other mining software.
Note that I pipe the output to the 'tee' command in order to store the output of the process in a log file called "mining_gpu0.log". This will be useful later for monitoring the script output, as we will retrieve from these logs files the hashrate for each GPU.
Note the parameter -r 5: This will make that the poclbm.py script will update the output (hashrate) one per each 5 seconds. The reason for this will be discussed later...

You should create additional gpux.sh files, one for each GPU. In my case I have 3 GPUs, so I have these additional scripts:

gpu1.sh
Code:
#!/bin/bash
export DISPLAY=:0.1
cd /home/your_path/scripts
DISPLAY=:0.1 aticonfig --pplib-cmd "set fanspeed 0 75"
DISPLAY=:0 ./poclbm.py -d0 -v -r 5 -w128 http://your_user@mail.com:your_password@deepbit.net:8332 | tee mining_gpu1.log

gpu2.sh
Code:
#!/bin/bash
export DISPLAY=:0.2
cd /home/your_path/scripts
DISPLAY=:0.2 aticonfig --pplib-cmd "set fanspeed 0 75"
DISPLAY=:0 ./poclbm.py -d0 -v -r 5 -w128 http://your_user@mail.com:your_password@deepbit.net:8332 | tee mining_gpu2.log

Is I said, I use the "screen" linux tool for sharing the outputs of the commands running. For example, we can share the script gpu0.sh in a shared console with the following command:
Code:
/usr/bin/screen -admS gpu0 ./gpu0.sh

This will create a shared console called "gpu0", that can be accessible though telnet or ssh with the following command:
Code:
screen -x gpu0
Therefore, we can watch the output of the execution of gpu0.sh. Note that for exiting in a shared console you have to use 'CTRL+A' and 'D' (to get detached of the shared console). Otherwise, you can stop the execution of gpu0.sh in that console.

Now, we can define the script start.sh that will launch the mining scripts:

start.sh
Code:
#!/bin/bash

cd /home/your_path/scripts
echo Starting mining scripts...
/usr/bin/screen -admS gpu0 ./gpu0.sh
/usr/bin/screen -admS gpu1 ./gpu1.sh
/usr/bin/screen -admS gpu2 ./gpu2.sh
...

Now, you can add this script (/home/your_path/start.sh) to your startup programs group. You can easily do it from from the system menu.

The control script has a lot of features, It is full of comments so I expect you have enough information there.
control.sh
Code:
#!/bin/bash
#---times constants
control_time=5 # Time cycle between control loops (5 seconds)
overclock_delay=180 # waiting time between overclocking commands (It is multiplier of control_time (180*5 = 15 min)
downclock_delay=60 # waiting time between downclocking commands (60*5 = 5 min)
downclock_urgent=24 # waiting time between urgent downclocking commands (24*5 = 2 min)
timeCounter=0 # time counter

#---GPUs temperatures
target_temp=75 # Target temp for the GPUs. Automatic Overclocking/downclocking will be performed for reaching this temperature as maximum.
hightemp_alarm=80 # Alarm temperature: If exceeded, it will be performed an urgent downclocking
maxtemp_stop=83 # maximum temperature in the GPUs: The mining process will be stopped for security resons.
temp_recover=65 # recovery temperature: After a mining stop due to high temperature, when this safe temperature is reached, the mining is already started.
control_gap=3 # Temperature below target_temp that is needed to be exceeded for an overclock command. (If current temperature is very near from the target temp, overclockin is not performed... we maintain the temp. near but below the limit)

#---CPU temperaturas
tempCPU_halt=70 #If this temperature is reached by the CPU or motherboard, a HALT is performed for turning of the RIG.

#---Clock limits
corefreq_min=800 # Minimum freq. to be set by the control algorithm
corefreq_max0=945 # Maximum freq. to be set by the control algorithm in GPU0 (In my case I checked that above 975Mhz this GPU hangs the X session)
corefreq_max1=955 # Maximum freq. to be set by the control algorithm in GPU1 (In my case I checked that above 995Mhz the mining process got zombie)
corefreq_max2=1025 # Maximum freq. to be set by the control algorithm in GPU2 (In my case I checked that above 1055Mhz the mining process got zombie)
mem_freq=300 # Fixed value for memory clock (normally it is 1200MHz in the GPUs, but using 300MHz reduces the temperature without affecting to the performance)

#--Mail sending
subject="Important advice from your RIG"
mail1="your_mail@mail.com"
mail2="other_mail@mail.com"
mail3="other_mail@mail.com"

#--control constants
retryMiningAfterFailure=1 # Mining scripts are automatically started after a failure
debug=0 # enable/disable debugging messages
numRetries=5 # Limit of retries for restarting the mining processes in the GPUs
reboot=1 # If zombie mining processes are detected, the control script can perform an automatic system reboot. This will recover mining in all GPUs.

#--FAN constants
FANGPU0=75
FANGPU1=75
FANGPU2=75

#--Internal variables
GPU0=0
GPU1=1
GPU2=2
mining_stopped=0 # Mining process has been stopped by the control algorithm
init_coreCLK0=900 # initial overclocking value for GPU0
init_coreCLK1=900 # initial overclocking value for GPU1
init_coreCLK2=900 # initial overclocking value for GPU2
counterLastCLK0=0 # Stores the time of the last overclocking/downclocking performed on GPU0
counterLastCLK1=0 # Stores the time of the last overclocking/downclocking performed on GPU1
counterLastCLK2=0 # Stores the time of the last overclocking/downclocking performed on GPU2
simulation=0 # Disables the overclockin, only logs outputs for debuguing .
retriesGPU0=0
retriesGPU1=0
retriesGPU2=0
alertFailProcessGPU0=0
alertFailProcessGPU1=0
alertFailProcessGPU2=0

# ---------------------------------------------------------------------
# Function Debug: It outputs messages to the console only in debug mode
# Parameters: Text Message to be displayed
# -----------------------------------------------------------------
function debug(){
if (test $debug -eq 1)
then
echo -e "[Time: $timeCounter | $(date | awk '{print $3 $2 $4}')] $@"
fi
}

# ---------------------------------------------------------------------
# Function output: This function output messages to the console
# $1: Message
# $2: if $2=1 the message is sent by email
# ---------------------------------------------------------------------
function output(){
mensaje="[Time: $timeCounter | $(date | awk '{print $3 $2 $4}')] $1"
echo -e $mensaje
if test $2 -eq 1
then
echo -e "$mensaje" | mail -s "$subject" $mail1
echo -e "$mensaje" | mail -s "$subject" $mail2
echo -e "$mensaje" | mail -s "$subject" $mail3
fi
}

# ----------------------------------------------------------------
# Function FANCommand: This function sets the FAN speed of a GPU
# Params: $1:num_gpu: $GPU0,$GPU1,$GPU2
# $2:FAN_SPEED: Value from 0 to 100 %, ej: 100
# Use: FANCommand $GPU0 100
# ----------------------------------------------------------------
function FANCommand(){
case $1 in
0)
DISPLAY=:0.0 aticonfig --pplib-cmd "set fanspeed 0 $2">>null;;
1)
DISPLAY=:0.1 aticonfig --pplib-cmd "set fanspeed 0 $2">>null;;
2)
DISPLAY=:0.2 aticonfig --pplib-cmd "set fanspeed 0 $2">>null;;
esac
output " New setting: FAN GPU$1 to $2 %" 0
}

# ----------------------------------------------------------------
# Function overclock: This function sets the clk of a GPU
# Params: $1:num_gpu : $GPU0,$GPU1,$GPU2
# $2:clkfreq : core clk value to be set, ej: 850
# $3:memfreq : mem clk to be set, ej: 300
# Use: overclock 0 850 1200
# ----------------------------------------------------------------
function overclock(){
if test $simulation -eq 0
then
case $1 in
0)
aticonfig --adapter=0 --odsc=$2,$3 >>null;;
1)
aticonfig --adapter=1 --odsc=$2,$3 >>null;;
2)
aticonfig --adapter=2 --odsc=$2,$3 >>null;;
esac
fi
output " New setting: Overclock GPU$1 to $2 / $3 Mhz" 0
}

# ----------------------------------------------------------------
# Function controlFAN: Calculates the FAN speed depending on the temperatures and current FAN speed (num_gpu, currentTemp, currentFAN)
# It performes the FAN speed control of a GPU
# Params: $1:num_gpu : $GPU0,$GPU01,$GPU2
# $2:currentTemp: Current value reported by the GPU (no decimals), ej: 56
# $3:currentFAN: Current FAN speed for this GPU, ej: 75 %
#
# Use: controlFAN $GPU0 56 75
# ----------------------------------------------------------------
function controlFAN(){

#hysteresis of 2ºC around temperature threshold 55º
if test $2 -lt 54
then
controlFAN=30
elif test $2 -lt 56
then
if test $3 -ne 45
then
controlFAN=30
fi
#hysteresis of 2ºC around temperature threshold 60º
elif test $2 -lt 59
then
controlFAN=45
elif test $2 -lt 61
then
if test $3 -ne 60
then
controlFAN=45
fi
#hysteresis of 2ºC around temperature threshold 65º
elif test $2 -lt 64
then
controlFAN=60
elif test $2 -lt 66
then
if test $3 -ne 75
then
controlFAN=60
fi
#hysteresis of 2ºC around temperature threshold 70º
elif test $2 -lt 69
then
controlFAN=75
elif test $2 -lt 71
then
if test $3 -ne 90
then
controlFAN=75
fi
else
controlFAN=90
fi

# It sends the FAN speed command only if the new setting is different to the current one.
case $1 in
0)
debug "FAN control GPU0: Current FAN=$FANGPU0, controlFAN:$controlFAN"
if test $controlFAN -ne $FANGPU0
then
FANCommand 0 $controlFAN
FANGPU0=$controlFAN
fi;;
1)
debug "FAN control GPU1: Current FAN=$FANGPU1, controlFAN:$controlFAN"
if test $controlFAN -ne $FANGPU1
then
FANCommand 1 $controlFAN
FANGPU1=$controlFAN
fi;;
2)
debug "FAN control GPU2: Current FAN=$FANGPU2, controlFAN:$controlFAN"
if test $controlFAN -ne $FANGPU2
then
FANCommand 2 $controlFAN
FANGPU2=$controlFAN
fi;;
esac
}

# ----------------------------------------------------------------
# Function controlTemp: Calculates the GPU clock correction to be performed depending on the GPU temperatures (num_gpu, currentTemp, consignaTemp)
# It performes the overclocking/downclocking of the GPU
# Params: $1:num_gpu : $GPU0,$GPU01,$GPU2
# $2:currentTemp: Current temperature reported by the GPU (no decimals), ej: 56
# $3:TargetTemp: Target temperature desired in this GPU as maximum, ej: 78
# Outputs:
# counterLastCLK0, counterLastCLK1 y counterLastCLK2: Time information of the last CLK correction on each GPU.
#
# Use: controlTemp $GPU 56 78
# ----------------------------------------------------------------
function controlTemp(){
offsetCLK=$(expr $3 - $2)

# temperature gap defined in 'control_gap' is guaranteed to avoid causing stress to the GPU when the current temperature is very near to the target temperature
if (test $offsetCLK -gt 0) && (test $offsetCLK -lt $control_gap)
then
debug "The correction ($offsetCLK) does not exceed the control GAP ($control_gap). CLK is maintained."
return 1
fi

# Demanded frequencies are limited to the specific clk ranges of each GPU
case $1 in
0)
demandaCLK=$(expr $coreCLK0 + $offsetCLK)
if test $demandaCLK -gt $corefreq_max0
then
demandaCLK=$corefreq_max0
fi;;
1)
demandaCLK=$(expr $coreCLK1 + $offsetCLK)
if test $demandaCLK -gt $corefreq_max1
then
demandaCLK=$corefreq_max1
fi;;

2)
demandaCLK=$(expr $coreCLK2 + $offsetCLK)
if test $demandaCLK -gt $corefreq_max2
then
demandaCLK=$corefreq_max2
fi;;
esac

if test $demandaCLK -lt $corefreq_min
then
demandaCLK=$corefreq_min
fi

debug "*** GPU$1 --> CurrentTemp:$2 - Consigna:$3 - Control:$demandaCLK ($offsetCLK)"

# Sending of overclock command, only if there is a change.
case $1 in
0)
if test $demandaCLK -ne $coreCLK0
then
overclock 0 $demandaCLK $mem_freq
counterLastCLK0=$timeCounter
debug "Tiempo contCLK0: $counterLastCLK0"
else
debug "GPU0: Limit is already reached: $corefreq_max0."
counterLastCLK0=$timeCounter
fi;;
1)
if test $demandaCLK -ne $coreCLK1
then
overclock 1 $demandaCLK $mem_freq
counterLastCLK1=$timeCounter
debug "Tiempo contCLK1: $counterLastCLK1"
else
debug "GPU1: Limit is already reached: $corefreq_max1."
counterLastCLK1=$timeCounter
fi;;
2)
if test $demandaCLK -ne $coreCLK2
then
overclock 2 $demandaCLK $mem_freq
counterLastCLK2=$timeCounter
debug "Tiempo contCLK2: $counterLastCLK2"
else
debug "GPU2: Limit is already reached: $corefreq_max2."
counterLastCLK2=$timeCounter
fi;;
esac
}

# ----------------------------------------------------------------
# Function checkOverclockTimeGuard: This function ensures a certain period of time between consecutives overclock commands.
# - Guard Time between overclocks: $overclock_delay*$control_time (180*5 = 15 minutes)
# - Guard Time between downclocks: $downclock_delay*$control_time (60*5 = 5 minutes)
# - Guard time between urgent downclocks: $downclock_urgent*$control_time (24*5 = 2 minutes)
# Params: $1:num_gpu : 0,1,2
# $2:timeCounter: Current value of the time counter
# $3:up_down: 0:overclock, 1:downclock, 2:urgent downclock
# Outputs:
# $return_correction: 0:Not to perform CLK correction. 1:CLK correction can be performed now.
# ----------------------------------------------------------------
function checkOverclockTimeGuard (){
return_correction=0
case $1 in
0)
if test $3 -eq 0
then ##overclock
due_time=$(expr $counterLastCLK0 + $overclock_delay)
if test $2 -ge $due_time
then
return_correction=1
fi
elif test $3 -eq 1
then ##normal downclocking
due_time=$(expr $counterLastCLK0 + $downclock_delay)
if test $2 -ge $due_time
then
return_correction=1
fi

elif test $3 -eq 2
then ##urgent downclocking
due_time=$(expr $counterLastCLK0 + $downclock_urgent)
if test $2 -ge $due_time
then
return_correction=1
fi
fi;;
1)
if test $3 -eq 0
then ##overclocking
due_time=$(expr $counterLastCLK1 + $overclock_delay)
if test $2 -ge $due_time
then
return_correction=1
fi
elif test $3 -eq 1
then ##normal downclocking
due_time=$(expr $counterLastCLK1 + $downclock_delay)
if test $2 -ge $due_time
then
return_correction=1
fi
elif test $3 -eq 2
then ##urgent downclocking
due_time=$(expr $counterLastCLK1 + $downclock_urgent)
if test $2 -ge $due_time
then
return_correction=1
fi
fi;;
2)
if test $3 -eq 0
then ##overclocking
due_time=$(expr $counterLastCLK2 + $overclock_delay)
if test $2 -ge $due_time
then
return_correction=1
fi
elif test $3 -eq 1
then ##normal downclocking
due_time=$(expr $counterLastCLK2 + $downclock_delay)
if test $2 -ge $due_time
then
return_correction=1
fi
elif test $3 -eq 2
then ##urgent downclocking
due_time=$(expr $counterLastCLK2 + $downclock_urgent)
if test $2 -ge $due_time
then
return_correction=1
fi
fi;;
esac
debug "-------GPU$1: due_time: $due_time, correction: $return_correction"
}

# ---------------------------------------------------------------------------------------------------------
# MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN
# ---------------------------------------------------------------------------------------------------------
#Overclockin enabling
aticonfig --adapter=0 --od-enable
aticonfig --adapter=1 --od-enable
aticonfig --adapter=2 --od-enable
overclock 0 $init_coreCLK0 $mem_freq
overclock 1 $init_coreCLK1 $mem_freq
overclock 2 $init_coreCLK2 $mem_freq
FANCommand 0 30
FANCommand 1 30
FANCommand 2 30
output "Automatic control algorithm has been started" 1 #This is sent by email.
output "Automatic FAN speed control starts from 30%" 0
while true; do

#Fetching of current temperatures
tempGPU0=$(aticonfig --adapter=0 --od-gettemperature | tail -n1 | awk '{print $5}' | cut -c1-2)
tempGPU1=$(aticonfig --adapter=1 --od-gettemperature | tail -n1 | awk '{print $5}' | cut -c1-2)
tempGPU2=$(aticonfig --adapter=2 --od-gettemperature | tail -n1 | awk '{print $5}' | cut -c1-2)
tempCPU=$(sensors |grep CPU |grep Temperature | awk '{print $3}'|cut -c2-3)
tempMB=$(sensors |grep NB |grep Temperature | awk '{print $3}'|cut -c2-3)

#Fetching of current GPU CLK frequencies
coreCLK0=$(aticonfig --adapter=0 --odgc |grep Clocks | awk '{print $4}')
memCLK0=$(aticonfig --adapter=0 --odgc |grep Clocks | awk '{print $5}')
coreCLK1=$(aticonfig --adapter=1 --odgc |grep Clocks | awk '{print $4}')
memCLK1=$(aticonfig --adapter=1 --odgc |grep Clocks | awk '{print $5}')
coreCLK2=$(aticonfig --adapter=2 --odgc |grep Clocks | awk '{print $4}')
memCLK2=$(aticonfig --adapter=2 --odgc |grep Clocks | awk '{print $5}')

#It detects if there are mining processes already running
miningIsActive0=$(ls /var/run/screen/S-your_user/ |grep gpu0 |wc -l) # we look for the 'screen' session lock file
miningIsActive1=$(ls /var/run/screen/S-your_user/ |grep gpu1 |wc -l) # we look for the 'screen' session lock file
miningIsActive2=$(ls /var/run/screen/S-your_user/ |grep gpu2 |wc -l) # we look for the 'screen' session lock file

loadGPU0=$(aticonfig --adapter=0 --odgc |grep GPU |awk '{print $4}' | cut -c1-2) # It is also checked the load of each GPU
loadGPU1=$(aticonfig --adapter=1 --odgc |grep GPU |awk '{print $4}' | cut -c1-2)
loadGPU2=$(aticonfig --adapter=2 --odgc |grep GPU |awk '{print $4}' | cut -c1-2)
debug " --> Temps: $tempGPU0, $tempGPU1, $tempGPU2"
debug " --> Clks: $coreCLK0, $coreCLK1, $coreCLK2"

# --------------------------------Temperature of CPU and Motherboard ---------------------------------------
if (test $tempCPU -gt $tempCPU_halt) || (test $tempMB -gt $tempCPU_halt)
then
output "ERR: Temperature of CPU/MB is too high! $tempCPU / $tempMB.... \nSWITCHING OFF THE SYSTEM. \n Check the CPU FAN condition and switch the RIG on manually." 1
/usr/bin/halt
fi

# Checking of zombie mining processes.
num_defunc=$(ps -Al |grep py|grep defunc| wc -l)
if test $num_defunc -gt 0
then
if test $reboot -eq 1
then
output "### ERR: There are one or more zombie mining processes:
\nMaybe a mining process is hanged and blocked.
\nIt is neccesary to restart the system for recovering the mining (sudo reboot).
\n$(ps -Al |grep py|grep defunc| wc -l)
\n --> PERFORMING AN AUTOMATIC REBOOT OF THE SYSTEM...." 1 # Sent my email
/usr/bin/reboot
else
output "### ERR: There are one or more zombie mining processes:
\nMaybe a mining process is hanged and blocked.
\nIt is neccesary to restart the system for recovering the mining (sudo reboot).
\n$(ps -Al |grep py|grep defunc| wc -l)" 1 # Sent my email
fi
fi

# --------------------------------Init checkings -------------------------------
if (test $miningIsActive0 -eq 0)
then
if (test $mining_stopped -eq 0) && (test $retryMiningAfterFailure -eq 1)
then
if test $retriesGPU0 -lt $numRetries
then
output "### ERR the mining process in GPU0 is not started ......" 0
output "*** Starting mining on GPU0..." 0
overclock 0 $init_coreCLK0 $mem_freq
coreCLK0= $init_coreCLK0
/usr/bin/screen -admS gpu0 ./gpu0.sh
retriesGPU0=$(expr $retriesGPU0 + 1)
elif test $alertFailProcessGPU0 -eq 0
then
output "### ERR Mining retries limit has been reached in the process GPU0.sh
\n*** Check that the process is not zombie and start it manually " 1 # Sent by email
alertFailProcessGPU0=1
fi
fi
fi

if (test $miningIsActive1 -eq 0)
then
if (test $mining_stopped -eq 0) && (test $retryMiningAfterFailure -eq 1)
then
if test $retriesGPU1 -lt $numRetries
then
output "### ERR the mining process in GPU1 is not started ......" 0
output "*** Starting mining on GPU1..." 0
overclock 1 $init_coreCLK1 $mem_freq
coreCLK1= $init_coreCLK1
/usr/bin/screen -admS gpu1 ./gpu1.sh
retriesGPU1=$(expr $retriesGPU1 + 1)
elif test $alertFailProcessGPU1 -eq 0
then
output "### ERR Mining retries limit has been reached in the process GPU1.sh
\n*** Check that the process is not zombie and start it manually " 1 # Sent by email
alertFailProcessGPU1=1
fi
fi
fi

if (test $miningIsActive2 -eq 0)
then
if (test $mining_stopped -eq 0) && (test $retryMiningAfterFailure -eq 1)
then
if test $retriesGPU2 -lt $numRetries
then
output "### ERR the mining process in GPU2 is not started ......" 0
output "*** Starting mining on GPU2..." 0
overclock 2 $init_coreCLK2 $mem_freq
coreCLK2= $init_coreCLK2
/usr/bin/screen -admS gpu2 ./gpu2.sh
retriesGPU2=$(expr $retriesGPU2 + 1)
elif test $alertFailProcessGPU2 -eq 0
then
output "### ERR Mining retries limit has been reached in the process GPU2.sh
\n*** Check that the process is not zombie and start it manually " 1 # Sent by email
alertFailProcessGPU2=1
fi
fi
fi

# --------------------------------Automatic switching off control -----------------------------
if (test $tempGPU0 -gt $maxtemp_stop) || (test $tempGPU1 -gt $maxtemp_stop) || (test $tempGPU2 -gt $maxtemp_stop)
then
if (test $mining_stopped -eq 0)
then

if test $retryMiningAfterFailure -eq 1
then
output "ERR: Extreme temperature in GPUs ($tempGPU0, $tempGPU1, $tempGPU2 ºC) - Switching off the mining...
\n After some minutes it will be strated again ..." 1 # Sent by email
else
output "ERR: Extreme Temperature in GPUs ($tempGPU0, $tempGPU1, $tempGPU2 ºC) - Switching off the mining...
\n Start the mining process manually ..." 1 # Sent by email
fi
./stop.sh
mining_stopped=1
fi

else # As soon as the GPUs temperatures are below temp_recover, mining is started again.
if (test $mining_stopped -eq 1) && (test $retryMiningAfterFailure -eq 1)
then
if (test $tempGPU0 -lt $temp_recover) || (test $tempGPU1 -lt $temp_recover) || (test $tempGPU2 -lt $temp_recover)
then
# It Sets safe GPUs clock values
output "The temperature of the GPUs has been recovered to $tempGPU0 / $tempGPU1 / $tempGPU2" 0
output "GPUS clocks are stablished to 850/300 MHz." 0
overclock 0 $init_coreCLK0 $mem_freq
overclock 1 $init_coreCLK1 $mem_freq
overclock 2 $init_coreCLK2 $mem_freq
coreCLK0= $init_coreCLK0
coreCLK1= $init_coreCLK1
coreCLK2= $init_coreCLK2
retriesGPU0=0
retriesGPU1=0
retriesGPU2=0
output " --> Starting mining." 0
./minar.sh
mining_stopped=0
fi
fi
fi

#------------------------------ Overclocking control on GPU0 ----------------------------------

if (test $mining_stopped -eq 0) && (test $miningIsActive0 -eq 1)
then
#The temperature is within the control margins, below target temp
if (test $tempGPU0 -lt $target_temp)
then
checkOverclockTimeGuard $GPU0 $timeCounter 0
if test $return_correction -eq 1
then
controlTemp $GPU0 $tempGPU0 $target_temp
fi

#The temperature is outside the control margins, below the alarm temp
elif (test $tempGPU0 -lt $hightemp_alarm)
then
checkOverclockTimeGuard $GPU0 $timeCounter 1 #downclocking
if test $return_correction -eq 1
then
controlTemp $GPU0 $tempGPU0 $target_temp
fi

# Overtemp alarm
elif (test $tempGPU0 -lt $maxtemp_stop)
then
output "Alarm! GPU0 very hot, temperature: $tempGPU0. Performing urgent downclocking ...." 1 #Sent by email
checkOverclockTimeGuard $GPU0 $timeCounter 2 #urgent downclocking
if test $return -eq 1
then
controlTemp $GPU0 $tempGPU0 $target_temp
fi
fi
# FAN Speed control
controlFAN $GPU0 $tempGPU0 $FANGPU0
fi

#------------------------------ Overclocking control on GPU1 ----------------------------------

if (test $mining_stopped -eq 0) && (test $miningIsActive1 -eq 1)
then
#The temperature is within the control margins, below target temp
if (test $tempGPU1 -lt $target_temp) && (test $miningIsActive1 -eq 1) && (test $mining_stopped -eq 0)
then
checkOverclockTimeGuard $GPU1 $timeCounter 0
if test $return_correction -eq 1
then
controlTemp $GPU1 $tempGPU1 $target_temp
fi

#The temperature is outside the control margins, below the alarm temp
elif (test $tempGPU1 -lt $hightemp_alarm) && (test $miningIsActive1 -eq 1) && (test $mining_stopped -eq 0)
then
checkOverclockTimeGuard $GPU1 $timeCounter 1 #downclocking
if test $return_correction -eq 1
then
controlTemp $GPU1 $tempGPU1 $target_temp
fi

# Overtemp alarm
elif (test $tempGPU1 -lt $maxtemp_stop) && (test $miningIsActive1 -eq 1) && (test $mining_stopped -eq 0)
then
output "Alarm! GPU1 very hot, temperature: $tempGPU1. Performing urgent downclocking ...." 1 #Sent by email
checkOverclockTimeGuard $GPU1 $timeCounter 2 #urgent downclocking
if test $return_correction -eq 1
then
controlTemp $GPU1 $tempGPU1 $target_temp
fi
fi
# FAN Speed control
controlFAN $GPU1 $tempGPU1 $FANGPU1
fi
#------------------------------ Overclocking control on GPU2 ----------------------------------
if (test $mining_stopped -eq 0) && (test $miningIsActive2 -eq 1)
then
#The temperature is within the control margins, below target temp
if (test $tempGPU2 -lt $target_temp) && (test $miningIsActive2 -eq 1) && (test $mining_stopped -eq 0)
then
checkOverclockTimeGuard $GPU2 $timeCounter 0
if test $return_correction -eq 1
then
controlTemp $GPU2 $tempGPU2 $target_temp
fi

#The temperature is outside the control margins, below the alarm temp
elif (test $tempGPU2 -lt $hightemp_alarm) && (test $miningIsActive2 -eq 1) && (test $mining_stopped -eq 0)
then
checkOverclockTimeGuard $GPU2 $timeCounter 1 #downclocking
if test $return_correction -eq 1
then
controlTemp $GPU2 $tempGPU2 $target_temp
fi

# Overtemp alarm
elif (test $tempGPU2 -lt $maxtemp_stop) && (test $miningIsActive2 -eq 1) && (test $mining_stopped -eq 0)
then
output "Alarm! GPU2 very hot, temperature: $tempGPU2. Performing urgent downclocking ...." 1 #Sent by email
checkOverclockTimeGuard $GPU2 $timeCounter 2 #urgent downclocking
if test $return_correction -eq 1
then
controlTemp $GPU2 $tempGPU2 $target_temp
fi
fi
# FAN Speed control
controlFAN $GPU2 $tempGPU2 $FANGPU2
fi
timeCounter=$(expr $timeCounter + 1)
sleep $control_time;
done

A brief description of the control script:
- You can change the minimum/maximum clock settings for each of your GPUs. I manually identified the limits by getting the system hang lot several times.
- I also realized that when playing with the limits, before hanging the system, sometimes a mining process got zombie (by using PS). In this situation, I was not able to recover the process, neither trying to kill the parent process... the only way was to restart the system. This control algorithm is restarting the system when finding zombie processes.
- You can play with all constants for tuning the script to your own system. Almost everything is configurable (retries number, mails sending, debugging logs, halt/reboot commands, etc.
- Change the email addresses by yours
- 1. The algorithm first obtain the GPUs temperatures, CPU temperatures, current clocks settings, checks if the mining processes are active, etc
- 2. In case the CPU temperature is very high (70ºC), the script switches off the system and report by email (This protects the system hardware from a overtemperature in the CPU)
- 3. It checks if there are zombie processes (As discussed before). If so, the script can reboot the system and report by email. (depends on if constant reboot=1)
- 4. It checks if any of the GPUS is not mining... if so it retries the mining by starting the script gpux.sh. There is a retry limit of 5. It reached it is also reported by email
- 5. In case a GPU has reached a very high temperature (83ºC) it stops all mining processes. After the temperature has been recovered, it restart the mining.
- 6. For each GPU, the script perform an automatic control of the GPU clock by overclockin, downclocking and urgent downclocking when needed.
- 7. For each GPU, the script perform an automatic control of the FAN speed.
As I did for the gpux.sh scripts, I like to launck the control from another script tubing the output to the tee command in order to store the logs in a file. Therefore we will easily get them
from the monitor script:

start_control.sh
Code:
#!/bin/bash
cd /home/your_user/scripts
./control.sh | tee control.log

Now, let's go to the monitor script:

monitor.sh
Code:
#!/bin/bash
while true; do
echo "---------------- GPUs Health ----------------"
aticonfig --adapter=0 --od-gettemperature | tail -n1 | awk '{print "GPU0 Temperature: " $5}' ;
aticonfig --adapter=1 --od-gettemperature | tail -n1 | awk '{print "GPU1 Temperature: " $5}' ;
aticonfig --adapter=2 --od-gettemperature | tail -n1 | awk '{print "GPU2 Temperature: " $5}' ;
echo $(aticonfig --adapter=0 --odgc | grep GPU);
echo $(aticonfig --adapter=1 --odgc | grep GPU);
echo $(aticonfig --adapter=2 --odgc | grep GPU);
echo "GPU FANS: $(DISPLAY=:0.0 aticonfig --pplib-cmd 'get fanspeed 0'|grep Result |cut -d ' ' -f 4) / $(DISPLAY=:0.1 aticonfig --pplib-cmd 'get fanspeed 0'|grep Result |cut -d ' ' -f 4) / $(DISPLAY=:0.2 aticonfig --pplib-cmd 'get fanspeed 0'|grep Result |cut -d ' ' -f 4)"
echo "Overclocking...."
echo "- Core Clocks: $(aticonfig --adapter=0 --odgc |grep Clocks |cut -d ' ' -f 18) / $(aticonfig --adapter=1 --odgc |grep Clocks |cut -d ' ' -f 18) / $(aticonfig --adapter=2 --odgc |grep Clocks |cut -d ' ' -f 18) Mhz."
echo "- Mem Clocks: $(aticonfig --adapter=0 --odgc |grep Clocks |cut -d ' ' -f 29) / $(aticonfig --adapter=1 --odgc |grep Clocks |cut -d ' ' -f 29) / $(aticonfig --adapter=2 --odgc |grep Clocks |cut -d ' ' -f 29) Mhz."
#echo " "
echo ---------------- PC health -------------------
echo $(sensors |grep CPU |grep Temperature) | cut -d ' ' -f 1,2,3
echo $(sensors |grep NB |grep Temperature) | cut -d ' ' -f 1,2,3
echo $(sensors |grep SB |grep Temperature) | cut -d ' ' -f 1,2,3
echo "HDD Avail: $(df -h |grep sda1 |cut -d ' ' -f 20)"
#echo " "
echo "---------------- Mining rate ------------------"
# Check if there are Screen lock files....
IsMining_gpu0=$(ls /var/run/screen/S-your_user/ |grep gpu0 |wc -l)
IsMining_gpu1=$(ls /var/run/screen/S-your_user/ |grep gpu1 |wc -l)
IsMining_gpu2=$(ls /var/run/screen/S-your_user/ |grep gpu2 |wc -l)
# Last Hashrate report
if test $IsMining_gpu0 -ge 1
then
echo "Mining on GPU0: " $(cat mining_gpu0.log | cut -d '[' -f 2)
else
echo "Mining on GPU0: ERR! Mining Process is stopped!"
fi

if test $IsMining_gpu1 -ge 1
then
echo "Mining on GPU1: " $(cat mining_gpu1.log | cut -d '[' -f 2)
else
echo "Mining on GPU1: ERR! Mining process is stopped!"
fi
if test $IsMining_gpu2 -ge 1
then
echo "Mining on GPU2: " $(cat mining_gpu2.log | cut -d '[' -f 2)
else
echo "Mining on GPU2: ERR! Mining process is stopped!"
fi
#Erase mining logs... next loop we will find only the hashrate.
echo "" > mining_gpu0.log
echo "" > mining_gpu1.log
echo "" > mining_gpu2.log
controlIsActive=$(ls /var/run/screen/S-vamach/ |grep control |wc -l)
#echo " "
echo "---------- Logs Mining Controller -------------"
if test $controlIsActive-ge 1
then
tail -5 control.log
else
echo "Control algorithm is OFF"
fi
sleep 5;
clear
done

A few notes for the monitor script:
- Note that the GPUs logs are erased at each loop. The time between loops is 5 seconds, exactly the same as the display rate of the poclbm.py. This way, we ensure that in the logs we will always find a single report of hashrates. In addition, the logs will not be increasing forever.
- the monitor script is also displaying the last 5 lines of the control script logs, stored in the file control.log
As before, I launch the monitor.sh script from start_monitor.sh:

start_monitor.sh
Code:
#!/bin/bash
cd /home/your_user/scripts
monitorIsRunning=$(ls /var/run/screen/S-your_user/ |grep monitor |wc -l)
if test $monitorIsRunning -ge 1
then
echo "Monitor script is already running in another Screen. Getting attached..."
screen -x monitor
else
echo "Monitor script is not running. Starting..."
/usr/bin/screen -admS monitor ./monitor.sh
fi

As you can see, this script is valid both for starting the monitor or for attaching to the screen in which the monitor is already running.
I did a symbolic link to this file called "m" (see ln command). From then, all I need to do for monitoring my rig is entering "m" in the console. (this is very comfortable when accessing to the RIG from my mobile)

This is the output of the monitor script (updated each 5 seconds):

---------------- GPUs Health ----------------
GPU0 Temperature: 57.50
GPU1 Temperature: 54.50
GPU2 Temperature: 57.50
GPU load : 98%
GPU load : 97%
GPU load : 98%
GPU FANS: 45% / 45% / 45%
Overclocking....
- Core Clocks: 945 / 955 / 1020 Mhz.
- Mem Clocks: 300 / 300 / 300 Mhz.
---------------- PC health -------------------
CPU Temperature: +31.0Â°C
NB Temperature: +43.0Â°C
SB Temperature: +31.0Â°C
HDD Avail: 2GB
---------------- Mining rate ------------------
Mining on GPU0: 402.743 MH/s (~458 MH/s)]
Mining on GPU1: 405.513 MH/s (~568 MH/s)]
Mining on GPU2: 435.513 MH/s (~598 MH/s)]
---------- Logs Mining Controller -------------
[Time: 18616 | 3ene15:50:30] New setting: FAN GPU0 to 30 %
[Time: 18617 | 3ene15:50:35] New setting: FAN GPU0 to 45 %
[Time: 18618 | 3ene15:50:40] New setting: FAN GPU1 to 45 %
[Time: 18783 | 3ene16:05:29] New setting: FAN GPU1 to 30 %
[Time: 18787 | 3ene16:05:51] New setting: FAN GPU1 to 45 %

Now, we can complete the start.sh script for adding the control and monitor scripts:

start.sh
Code:
!/bin/bash

cd /home/your_path/scripts
echo Starting mining scripts...
/usr/bin/screen -admS gpu0 ./gpu0.sh
/usr/bin/screen -admS gpu1 ./gpu1.sh
/usr/bin/screen -admS gpu2 ./gpu2.sh
echo Starting monitor script...
/usr/bin/screen -admS monitor ./monitor.sh
echo Starting automatic control script...
/usr/bin/screen -admS control ./start_control.sh
echo " "
echo For monitoring the RIG, enter m.

And of course, we will need a stop.sh script for stopping all the mining scripts, monitor and control scripts:

stop.sh
Code:
#!/bin/bash
screen -X -S gpu0 kill
screen -X -S gpu1 kill
screen -X -S gpu2 kill
screen -X -S monitor kill
screen -X -S control kill
killall screen

That's all folks!!

I hope you liked this post, and will be useful for your mining systems!!

If you liked this post, and want to send me a donation, I will be very gratefull, and will give me energy for sharing other works.
BTCTC Address: 1NKJuhGCx7HM2skXdzAkfnxJyfsubh475A