@todxx @kerney666 - i'm struggling w/ some stability issues. linux/0.4.1/cnr/15+15 on an 8x64 rig - can run fine for 1hr or 6 hrs, but seems to always fall over eventually. I'm definitely at oc/uv thresholds, but where normally I could isolate crashes to specific cards and adjust settings, in this case it's a different card every time. It's not temps (all under 45c) or hardware errors (none reported by miner or syslog.) I'm wondering if it's maybe following network hiccups or dev fee switching? Tho i can't find any messaging in the logs - any way we can get network/dev fee notices in the logs? I'm also seeing init discrepancies - tho not as bad as 0.4.0 - but sometimes random cards underperform by 10-15h/s run-to-run (was more like 40-50 on 0.4.0), so possibly something to do w/ that?
Also, tried 0.4.2, and it falls over w/in seconds for the same settings - seems much harsher on init for some reason.
Hey pbfarmer, what's the current status here? It feels like we're really at the edge of a cliff here, the changes between 0.41 and 0.4.2 are so tiny. In 0.4.2 we drive the jobs in a more simple, straightforward way, but it should really have zero effect on stability.
There is no dev fee switching in the miner, we mine user + dev concurrently, so there is no interruption, variable pressure or potential hick-up from that perspective. CN/r will by design have a variable compute pressure, for some block heights you will be unfortunate and get a huge amount of multiplications, the next block you just get a few muls and a bunch of simple xors and subs instead. So maaaaybe the worst case scenario here could be problematic if you're tuned to a more average load.
There is a little clue in your 0.4.0 vs 0.4.1 description though, we did change a initial delay in 0.4.1. CN mining on gpu is all about keeping a certain delay between the two threads to get a proper overlap. If the threads gravitate and the delay/offset creeps too close to zero, you will lose a little of your hashrate. If they fully coincide, it will be very visible. This hasn't really been a problem for us before, but for some reason CN/r is a little bitchy.
One interesting thing related to this is that if the threads do coincide, the power draw profile for that gpu will change. There will be longer full throttle periods at the beginning and end of each cycle, and longer periods in-between with a much lower power draw. So, this is a wild guess, but maybe this is what happens by random chance over time for one of the gpus. If you do have logs, it could potentially be visible that the hashrate for the gpu that dies is decreasing the print(s) before it dies.
It's still very surprising that 0.4.2 dies within a few seconds, wow. Need to ponder this a bit more, then maybe give you a few test builds with additional logging.