Post
Topic
Board Hardware
Re: KnC die #0 disabled
by
dlasher
on 07/11/2013, 23:22:59 UTC
Monitordcdc has more changes: Interval for checking VRMs that ouput zero current in monitordcdc was decreased from 15 minutes to 20 seconds (15 checks in 1minute vs 5 checks in 4secs). When VRM has more than 3 failures(=zero current output) in this 20 sec interval the die powered by this VRM is restarted (this was not present in 0.98). I am not sure why die 0 is restarted only when other dies have failed too (maybe die 0 is somehow connected to other dies?).

I hacked up the monitordcdc script, tossing in a 'logger' line to write something to syslog each time it tries to restart a die.. (you have to start /etc/init.d/syslog.busybox as well, and then tail -f /var/log/messages)

like this:
Code:
                if [ "$failed1" = "1" ] ; then
                        i2cset -y 2 0x2$channel 0xe5 1
                        logger -t "dcdc" "i2cset -y 2 0x2$channel 0xe5 1 "



What I see looks like it's trying to restart individual dies, no matter which one it is.

Quote
Nov  7 23:10:49 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:12:35 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  7 23:12:35 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:14:23 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:16:09 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:17:57 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  7 23:17:57 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:19:43 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2

0x20 = module 0
0x22 = module 1
0x24 = module 2
0x27 = module 3 (I think)

0, 1, 2, 3 = dies

So that equates to 3rd module, 3rd core, then first module, 4th core, then repeatedly 3rd module, 3rd core.

In my case, 0/4 can be restarted, and it does every few minutes when it stops, but 3/3 will never restart.. but the script tries repeatedly.