I could see things not getting updated being a sd card issue, if the pattern of LEDs is the same on the dead board as the others then you don't have a blown chip, normally there drastically different or not at all, you said green/red LEDs so I am guessing the boards are green, check cables , swap two of them and see if the dead board moves or stays the same, if it moves then maybe a bad cable if the same then might be a messed up board, most trouble shooting steps at this point are very advanced, requires probing stuff and such. I don't know if I would buy a new board really the cost vs/ the payback is going to be a bit off .
Ok, so after some troubleshooting with no luck, I decided to let it be for a few days. Suddenly tonight I had another board basically die. This has to be a power supply issue, right? Should I try switching out the 6+2 pin PCI cables? Or switching around where they're plugged in? PSU is fully modular, so I have a few open plugs to try, but I'd imagine if one part of the power rail fails, it all does... I'm starting to get really worried that I dropped over $900 on a rig that is just falling apart, and then another $280 for a PSU that I got no reimbursement for from Zoomhash, but I'm holding out hope that it's just the PSU. Is there a way I can test the output of the PSU? Having 2 boards die off of a 1300W gold rated output shouldn't happen. Hoping for a very fast response before I throw this thing off of a cliff.
I've got 7 - A2's. Had a couple from when they first shipped and got 4 more when pricing on scrypt devices was at bottom.
Almost all of them have required some maintenance, but I've only had 1 hashboard fail completely. I think it burned from heat overload as there was a fan that died on it and the hashboard was relatively loose from the chassis - so little heat was being exchanged to the chassis. I've done a fair amount of tweaking/soldering/ etc to keep these operational. One HUGE point of weakness is the soldering points of the PCI power connectors to the board. I'd say 1 out of 4 have lost full conductivity - mostly at the soldering joint of pin to the board, poor conductivity of power leads to intermittent drops of the board.
Since you swapped out your PSU, that means you've had to disconnect and reconnect these - and good chance that one or two may have integrity loss of connection.
Sometimes boards can benefit from removal and reinsertion of the data cables as well. Which is another weak point on these boards. I've literally pulled some of the L shaped pins out of the soldering points on the board, or broken pins off. Be extremely careful when removing the cables - again if you have a pin that has a poor connection it may result in hashboard just blinking and never synchronizing - much the same as a bad power connection.
If the problem is either the PCI connection pins or the data cable connection pins - it can be fixed. I've actually had a unit fall 4 feet from a shelf and broke off 2 of the PCI pin connectors from the board, breaking the PCI board in the corner. There is an additional 6 pin power connection on the board as well - underneath the 8 pin. I've used this second set of connections on the board and soldered some 6 pin connections to them and got the boards working. With the boards that have had broken pins on the data cable, I either resoldered those pins back to the board or stripped the 10 pin data cables and soldered the wires directly to the board. In all cases the boards have worked again. Only the one instance of the suspected burnt chip on one board have I not been able to resurrect.
A good way to start troubleshooting is disconnect all datacables (I suggest at the controller board instead of each individual hashboard - to avoid pin stress on the hashboard). Leave the one board you want to troubleshoot plugged in - disconnect all power to hashboards except to that single board as well. This reduces the chance that another board is interfering with the function of the one being tested. In some cases one failed power connection could result in all the connected boards to fail synchronization.
You could also remove the heatsink and look visually at the chips to see if any appear burned.