Page 2 of 7

Re: General Troubleshooting ideas

Posted: Wed Jun 04, 2014 7:33 pm
by jrweiss
Are the machines in the same location and/or on the same electrical circuit? If so, it may be as simple as a temporary line voltage fluctuation that the PSUs can't handle. Are they on UPSes? If not, you may try one (or more) on the unstable machine[s].

Re: General Troubleshooting ideas

Posted: Wed Jun 04, 2014 8:02 pm
by ChasingTheDream
Thanks for the ideas.

They are located in a basement on three different 20 AMP circuits. Each of the three troubled machines are on different circuits. There is one somewhat stable machine and then one somewhat unstable machine on each circuit.

In a previous message I had mentioned that these machines had run for literally weeks unattended while doing Scrypt mining at a much higher power draw than they are currently running under. The issues did not start until I moved them to F@H. However they are not running through a UPS so that is something I could try but this is another case where I can't help but wonder why is there suddenly a difference. I could move my most troubled machine back to Scrypt mining just to see if it would run stable for a few days just to see if it still displays this odd behavior. It would be interesting.

Re: General Troubleshooting ideas

Posted: Wed Jun 04, 2014 8:19 pm
by P5-133XL
The log you supplied is of a machine that is folding normally with no apparent problems which is a good thing. However, to diagnosis a problem we need to see failure.

a 99% out of sync condition is most likely a video driver failure. There is a common failure mode when the video driver resets. Folding, appears to continue in the advanced/web control till it reaches 99.9% and there it stays while the log stops recording at the video driver failure point. A pause/unpause or a client restart will start folding at the last good checkpoint but the PPD will suck because all that time was wasted.

The problem with a video driver failure is that it is non-specific as to cause. It is mere a report to the OS that the video card quit responding normally. So driver resets the video card, very commonly also dropping the clock rate to a very low number. The reset invariably crashes the running folding cores on that card.

Personally, I think I would limit folding to one GPU/machine on any problem machine and see if that makes a difference. The basic troubleshooting goal is to take on one only thing at a time and isolate all variables. Simple is good. Only add complexity later, when things start working properly and then in a very controlled manner so it won't be hard to identify the problem when things start going wrong. You have enough machines, with enough problems a log sheet keeping track of what you did and what happened would not be a bad idea especially if you are running multiple experiments at a time trying to isolate the problems.

Re: General Troubleshooting ideas

Posted: Wed Jun 04, 2014 8:25 pm
by PantherX
ChasingTheDream wrote:...In a previous message I had mentioned that these machines had run for literally weeks unattended while doing Scrypt mining at a much higher power draw than they are currently running under. The issues did not start until I moved them to F@H. However they are not running through a UPS so that is something I could try but this is another case where I can't help but wonder why is there suddenly a difference. I could move my most troubled machine back to Scrypt mining just to see if it would run stable for a few days just to see if it still displays this odd behavior. It would be interesting.
Please note that folding may use different components of the GPU that mining or gaming doesn't use. Thus, stress testing those components will not mean anything for folding since those components aren't being used in the first place. For problematic GPUs, the first step is to remove any kind of overclock, including factory ones, and reverting to the AMD/Nvidia stock settings to see if the issue is resolved or not. A fair share of issues were solved with this step only. In your case, it didn't solve the issue so we need to rule out other potential causes.

Re: General Troubleshooting ideas

Posted: Wed Jun 04, 2014 11:44 pm
by jrweiss
Is it possible there is a conflict between MSI Overdrive and Catalyst? You might try uninstalling Overdrive and reverting to factory clocks all around (GPU and RAM).

As for electric power, warm weather in some areas causes brownouts (voltage drops) as air conditioners come on line. They could be local or widespread. Is there any correlation in time of crashes/hangs?

Re: General Troubleshooting ideas

Posted: Thu Jun 05, 2014 5:41 am
by ChasingTheDream
Thanks for the all the tips guys. I'm really short on time right now so I keep popping back in as often as I can.

I'll remove one GPU from one of the machines tomorrow and focus on it. Whatever we find there will most likely apply to the others. I do wonder about the PCIE2 vs PCIE3. My most stable machines are PCIE2. The 6 nearly identical machines are all PCIE3 and they are nowhere near as stable as the three older multipurpose machines. All the older machines are PCIE2. Tonight I did a BIOS upgrade to all 6 of the nearly identical machines to make sure they were all on the same versions.

Does any error that would appear in the logs get cleared out when I reboot the computers? If so that is probably why I never see errors in the F@H logs. As mentioned in my previous posts often times I have no choice because the computers will not respond to keyboard or mouse input after they get out of sync. I have seen a video driver failure and I did not lose control of the computer like I do when "something" worse happens.

Regarding the AC. My AC is on at least half the time because of all the GPU's running in the house. It is on day and night. I have not noticed any link between the AC and the machines acting odd.

MSI Afterburner is only on the machines because of all the "out of sync WU" that were happening anyway. The cards wouldn't run an hour without an out of sync condition at the factory clock speeds.

I'll check the machines in the morning again and hopefully I'll be able to get a log with some errors in it.

Re: General Troubleshooting ideas

Posted: Thu Jun 05, 2014 6:30 am
by P5-133XL
There is a log folder off the main data folder for v7 that contains, by default, the previous 16 logs.

Re: General Troubleshooting ideas

Posted: Thu Jun 05, 2014 8:12 am
by davidcoton
Error messages get lost if a PC crashes before writing them to the disk. This is unusual with software errors., but common with hardware errors. In any case FAH does not log hardware failures (sometimes the software consequences are logged if the fault is not fatal). Hardware errors may be logged by the OS, again if enough of the PC is still working. Your symptoms point to a hardware error outside FAH -- maybe induced by FAH use, but FAH is not the problem itself.

I agree the general approach should be to take one troublesome PC and strip it down to make a simpler system (one GPU, possibly less memory). Get that folding without crashing before building it up bit by bit. As you say, the problem is probably similar on each PC so finding a solution on one may lead to fixing them all.

David

Re: General Troubleshooting ideas

Posted: Thu Jun 05, 2014 3:51 pm
by bruce
I recommend you avoid physically removing the extra GPUs. Just avoid using them by pausing all GPU slots except one and avoiding mining with them until the simpler configuration is debugged. Physical changes can introduce configuration changes.

Re: General Troubleshooting ideas

Posted: Thu Jun 05, 2014 4:34 pm
by ChasingTheDream
This morning I had two machines down again after the MB BIOS upgrade last night which is normal for my machines. So the BIOS upgrade made no difference.

Bruce, I wish I would have seen your message before I removed a GPU because it seems to have completely hosed up the F@H client. I've tried a fresh install of the F@H client and a reinstall of the video drivers etc. I guess I'll try to put the GPU back in place and see if that makes the F@H client happy because it is completely unresponsive at the moment. If not, I guess I'll wipe the computer and reinstall everything fresh when I get a few hours. It will just have to sit down until then.

Update: Putting the GPU back in place made the F@H client happy again. Unfortunately this means I lost two WU. Sorry about that. It wasn't intentional. I have one of the GPU's set to finish but it will be awhile. Then I can focus on just one.

I was able to see the logs on one of the machines that was clearly unstable. Again there were no errors or warnings so I doubt the logs are going to give us any clues as to what is going wrong. I also checked the Window Event Viewer during the time I was asleep and the instability set in and there were no error messages.

I have a strong suspicion it has something to do with the video drivers but I had the same issues on 13.12 so I'm sure there is anything I can do about it if that is the case.

Do you think Crossfire could have anything to do with it? I do not have Crossfire enabled on the six nearly identical machines but I do on the only other machine I have with two of these GPU's. It is vastly more stable using the same video drivers but it is also running at PCIE-2, different MB, etc.

Another update... So I set the BIOS down to PCIE-2, enabled crossfire, and I moved the clock speeds of the cards back up to something that would seem reasonable clock speeds core: 947, memory speed: 1000. The factory defaults are core: 1040 memory: 1300 so even my reasonable speeds are quite low considering where the cards start.

So the cards were off and running again and I notice they don't seem to be running very fast. So I check windows event viewer immediately and now I can see ATI driver failures every 10 minutes or so. After a number of these driver crashes the clock speeds have plummeted down to the lowest possible settings core: 520, memory: 650. So it does appear to be a driver issue and I'm guessing that after dozens of driver failures the system becomes unstable.
I did not restart this machine because I wanted to see if stability degraded and sure enough the system stability degraded to the point where it will no longer respond to keyboard or mouse input. I now have no choice but to press the reset button.

On a different machine I'm seeing errors like the following:

amdacpusrsvc
acpusrsvc: IOCTL_ACPKSD_KSD_TO_USR_SVC_SET_FB_APERTURES: FAILED

I'm seeing these errors on the machine that actually locked up over night again. It appears I was looking in the wrong place in event viewer for the errors!

I suspect that the longer the driver and other AMD errors go on the less stable the machines become which is why I see machines in various states. A machine becomes more and more unresponsive until it hard locks.

When I get the system down to one GPU, what steps should I take next? So far literally everything I've tried has made no difference. The only thing I can think of trying is will the GPU run at a higher clock rate if it runs alone.

Re: General Troubleshooting ideas

Posted: Fri Jun 06, 2014 2:34 am
by PantherX
A quick search indicates that you may not be alone in this issue (https://forums.whirlpool.net.au/archive/2269797). You can try some of the tips mentioned in that thread to see if it helps you to figure out what the issue is or not.

BTW, when you stated that you were using 14.4 WHQL, which version did you initially download:
1) The first release
2) The re-release

Do make sure that you get the re-release since the first release was causing issues on some systems (viewtopic.php?p=263689#p263689).

Re: General Troubleshooting ideas

Posted: Fri Jun 06, 2014 4:04 am
by jrweiss
Also, if you upgraded the drivers without cleaning the Registry, you might try an uninstall, Registry clean, and re-install. CCleaner (piriform.com) and DriverSweeper are popular tools for Registry cleaning.

Re: General Troubleshooting ideas

Posted: Fri Jun 06, 2014 4:11 am
by ChasingTheDream
I downloaded the 14.4 drivers on 5/13/2014 so I should be on the re-released drivers.

I read the thread posted above. Unfortunately I've done everything in it already and it made no difference. These GPU's have been the bane of everything I've done. They are by far the most unstable cards I've ever had but I'm stuck with them at this point.

I have dropped the clock speeds both core and memory down on all 6 of the nearly identical machines to 650 and 3 of them are down again as I type this. They didn't make it 4 hours and it isn't always the 3 that I would consider the most unstable. It looks like the machines are equally unstable.

I guess the only thing left to try is dropping their speeds to the minimums allowed and see if the machine can run more than 24 hours. I don't have 5-6 hours a day to just restart machines every time they fail so clearly I'm going to have to come up with something else. As a side note as I was physically rebooting the machines I heard two others do auto restarts from Windows crashes. So in a sequence of 4-5 hours 5 of the 6 nearly identical machines failed again and they were running at essentially half speed. I've now set their clocks so they are running at exactly half speed.

Three of the computers had clean installs of Windows 7 x64 on them and went straight to 14.4. It hasn't made any difference in terms of stability.

How can I remove a GPU psychically from the computer without completely confusing the F@H client? I'm rapidly running out of options.

The only things I can think of left are drop back to 13.12 which is where I saw the out of sync conditions in the first place so I don't expect any difference there and trying to run with only one GPU in all the machines.

Re: General Troubleshooting ideas

Posted: Fri Jun 06, 2014 5:02 am
by PantherX
ChasingTheDream wrote:...I read the thread posted above. Unfortunately I've done everything in it already and it made no difference. These GPU's have been the bane of everything I've done. They are by far the most unstable cards I've ever had but I'm stuck with them at this point...
Did you specifically do either of these:
1) Leave the system on the BIOS screen for few hours.
2) Boot into Linux using a live CD (it may also work with a USB too).

If in either cases, your system crashed, then it is unfortunately, a hardware issue.
ChasingTheDream wrote:...I have dropped the clock speeds both core and memory down on all 6 of the nearly identical machines to 650 and 3 of them are down again as I type this. They didn't make it 4 hours and it isn't always the 3 that I would consider the most unstable. It looks like the machines are equally unstable...
If all the systems are indeed equally unstable, then you can focus only on the common settings to help you narrow down the potential cause of the issue.
ChasingTheDream wrote:...I don't have 5-6 hours a day to just restart machines every time they fail so clearly I'm going to have to come up with something else. As a side note as I was physically rebooting the machines I heard two others do auto restarts from Windows crashes. So in a sequence of 4-5 hours 5 of the 6 nearly identical machines failed again and they were running at essentially half speed. I've now set their clocks so they are running at exactly half speed...
Troubleshooting 6 systems simultaneously would be a significant undertaking. I would suggest that you set 5 systems to finish the WU and then switch them off. That way, you won't be concerned about them and can focus all your energy and time only on a single system so you can find the solution quickly.

Alternatively, you can create a scheduled task to automatically reboot the system after every 3 hours (http://www.instructables.com/id/Shutdow ... uter-on-a/).
ChasingTheDream wrote:...How can I remove a GPU psychically from the computer without completely confusing the F@H client? I'm rapidly running out of options...
The easiest method is to do a fresh installation (remember to select the option to delete the data) but ensure that all assigned WUs are finished to prevent unnecessary dumping of WUs.

Re: General Troubleshooting ideas

Posted: Fri Jun 06, 2014 6:00 am
by ChasingTheDream
Thanks for the info PantherX.

I didn't leave a machine sitting on the BIOS screen for several hours. I suppose I can. The issue isn't keep the machines up. The issue is keeping the machine up and running the F@H client. As I've mentioned previously these machines ran for weeks unattended while Scrypt mining. They only have trouble when I try to use them for F@H and I'm not sure why. I'm aware the memory usage is quite different with F@H.

The problem described in the posted thread was different than what I run into. That person was experiencing a shut down at 30 minutes no matter what he was doing. That is not the case with my machines. It seems that the AMD drivers, F@H and my combination of hard just aren't compatible. I've got a single GPU system putting out more PPD now they my duel GPU systems using the same GPU's. That is rather funny.

I guess the issue is I'm not exactly sure what else can to done to troubleshoot. I've swapped GPU's, memory, fresh Windows installs, fresh AMD driver installs, under clocked the GPU's, under clocked the system memory, moved power cables, switched GPU slots, updated BIOS on all machines, switched to PCI2, and enabled crossfire. None of it has made any difference at all.

The only thing left I can even think of it running the memory stress test and maybe CPU stess test. I will be stunned if all the memory in all the computers is bad though and same applies to the CPU. I would be more inclined to think it is a hardware defect if it was one machine. It's hard to imagine I have the same defect in 6 nearly identical machines.

Thanks for the restart link. I'm going to have to pursue that.

What would you suggest I try next because I'm pretty much out of ideas?