@todxx @kerney666 - i'm struggling w/ some stability issues. linux/0.4.1/cnr/15+15 on an 8x64 rig - can run fine for 1hr or 6 hrs, but seems to always fall over eventually. I'm definitely at oc/uv thresholds, but where normally I could isolate crashes to specific cards and adjust settings, in this case it's a different card every time. It's not temps (all under 45c) or hardware errors (none reported by miner or syslog.) I'm wondering if it's maybe following network hiccups or dev fee switching? Tho i can't find any messaging in the logs - any way we can get network/dev fee notices in the logs? I'm also seeing init discrepancies - tho not as bad as 0.4.0 - but sometimes random cards underperform by 10-15h/s run-to-run (was more like 40-50 on 0.4.0), so possibly something to do w/ that?
Also, tried 0.4.2, and it falls over w/in seconds for the same settings - seems much harsher on init for some reason.
Hey pbfarmer, what's the current status here? It feels like we're really at the edge of a cliff here, the changes between 0.41 and 0.4.2 are so tiny. In 0.4.2 we drive the jobs in a more simple, straightforward way, but it should really have zero effect on stability.
There is no dev fee switching in the miner, we mine user + dev concurrently, so there is no interruption, variable pressure or potential hick-up from that perspective. CN/r will by design have a variable compute pressure, for some block heights you will be unfortunate and get a huge amount of multiplications, the next block you just get a few muls and a bunch of simple xors and subs instead. So maaaaybe the worst case scenario here could be problematic if you're tuned to a more average load.
There is a little clue in your 0.4.0 vs 0.4.1 description though, we did change a initial delay in 0.4.1. CN mining on gpu is all about keeping a certain delay between the two threads to get a proper overlap. If the threads gravitate and the delay/offset creeps too close to zero, you will lose a little of your hashrate. If they fully coincide, it will be very visible. This hasn't really been a problem for us before, but for some reason CN/r is a little bitchy.
One interesting thing related to this is that if the threads do coincide, the power draw profile for that gpu will change. There will be longer full throttle periods at the beginning and end of each cycle, and longer periods in-between with a much lower power draw. So, this is a wild guess, but maybe this is what happens by random chance over time for one of the gpus. If you do have logs, it could potentially be visible that the hashrate for the gpu that dies is decreasing the print(s) before it dies.
It's still very surprising that 0.4.2 dies within a few seconds, wow. Need to ponder this a bit more, then maybe give you a few test builds with additional logging.
I've been peeling off cards one by one as they die - down 3 right now, but on a ~12hr run so far, which is promising.
Thanks for all the insights - this is helpful... Any sense of the distribution/std. dev of mathematical difficulty (as related to compute pressure) between blocks? I.e. would a few blocks run be highly indicative of compute pressure, or do we need to look at much larger windows (hence the failures after 5-6 hours)?
Re the thread overlap - I wonder if some sort of occasional thread re-sync mechanism may avoid any concern over drift? Regardless, I can watch the logs closely to see if something happens to h/r near the time of crash/hang. That being said, the h/r discrepancies I mentioned are visible immediately from launch, so it sounds like there can be situations where the threads aren't ideally offset even following init. For instance, w/ all 8 cards online, I usually saw ~16.82, but sometimes as high as 16.87, or as low as 16.76 at startup.
As for 0.4.2, lemme know any tests you need. I also saw this behavior on another rig, and mentioned it in a reply to another post here. In that case, the miner locked up right after platform detection (I assume trying to init the first GPU,) though 0.4.1 had been fine on the same settings. I bumped the cclock down or voltage up (can't remember which) and its fine now, so I will try again on this rig once I get a sense for which GPUs are being difficult. Could is just be that 0.4.2 is somehow finding/testing the threshold earlier?
EDIT: of course, immediately after i hit submit, rig goes down again... No indication of h/r loss in logs leading up to hang.