General Troubleshooting ideas

If you're new to FAH and need help getting started or you have very basic questions, start here.

Moderators: Site Moderators, FAHC Science Team

Rel25917
Posts: 303
Joined: Wed Aug 15, 2012 2:31 am

Re: General Troubleshooting ideas

Post by Rel25917 »

ChasingTheDream wrote:
There is clearly something the machines don't like about trying to fold with two of my GPU's at the same time, but it is still a mystery as to what it is.
It's possible you just got lucky and got a motherboard/gpu combo that just don't work 100% together. My main gaming computer has an Asus motherboard with 1 16x pcie slot 1 16x sized but 4x speed slot and 2 1x slots. If I have anything in a 1x slot the 4x drops to 1x. With a Titan in the 16x and a 770 in the 4x and a sound card in a 1x slot a weird thing happens. If folding is running on the Titan and the 770 is paused for a while when I start the 770 my sound channel mapping gets borked(left and right start coming from left and rear left or both rears or whatever). A few button presses in the sound software gets them back to normal. As long as folding stays running it doesn't mess anything up when a new unit starts on the 770.
ChasingTheDream
Posts: 56
Joined: Mon Jun 02, 2014 10:56 pm

Re: General Troubleshooting ideas

Post by ChasingTheDream »

PantherX wrote:FYI, you could maybe head over to the EVGA Forum (http://forums.evga.com/EVGA-Z87-Series-f88.aspx) and ask about your motherboard since that is the only common component across all your unstable systems?
I've got a thread going on their forum. It is titled "Anyone have issues folding with Z87 Classified Motherboards and 290X TRIX cards?" On that board my id is imaniceguy67.

They are the group that suggested that I overvolt the GPU's which I tried but it didn't help.

I do think I have a hardware incompatibility but I'm hoping it is BIOS driven. I've even called EVGA support about it. They say if the boards will mine with the GPU's and if they can run the stress tests, there is no way it can be the boards. EVGA support thinks it is the F@H client or a driver issue.

When I get time I may try to start a thread on the AMD forum as well.

One other thing I've noticed is the two GPU systems with EVGA motherboards are running 20% - 30% slower (in PPD) compared to the one machine that will run for days running two of the same GPU's but has an ASUS motherboard. I've removed CPU folding on all the machines so it shouldn't be a bottleneck regarding cores or threads.

@Rel25917: I have this fear as well. I upgraded the BIOS on all the machines hoping that it would take care of any incompatibilities. So far nothing has worked.
Rel25917
Posts: 303
Joined: Wed Aug 15, 2012 2:31 am

Re: General Troubleshooting ideas

Post by Rel25917 »

Have you gone through the video card settings looking for power saving settings? I know on nvidia cards you would sometimes need to set the power management mode to prefer maximum performance to keep the clock speeds up. Not sure if amd cards have anything similar.
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: General Troubleshooting ideas

Post by PantherX »

ChasingTheDream wrote:....One other thing I've noticed is the two GPU systems with EVGA motherboards are running 20% - 30% slower (in PPD) compared to the one machine that will run for days running two of the same GPU's but has an ASUS motherboard. I've removed CPU folding on all the machines so it shouldn't be a bottleneck regarding cores or threads...
Assuming that the Projects were the same when you compared them, it seems that you might be saturating the PCI-E Lanes which may have a negative impact on the PPD (viewtopic.php?p=254816#p254816).

In your BIOS, can you disable additional features which you don't need, i.e. sound, etc? Also, see if changing the PCI-E Lane configuration changes the percentage difference in PPD.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: General Troubleshooting ideas

Post by bruce »

Let's not exclude the possibility that it's related to RAM settings without further evaluation. It wouldn't be high on my list, but memory timing only becomes really critical when every other component is pushing the limits of concurrent system resources.
folding_hoomer
Posts: 349
Joined: Sun Feb 10, 2013 6:06 pm
Hardware configuration: Sys 1: I7 2700K@4,4GHz with NH-C14
8GB G.Skill Sniper DDR3 1866MHz CL 9-10-9-28
MSI Z68A-GD65 (G3), various operating systems (WinXP, Ubuntu: 10.4.3 LTS, 12.04.2 LTS)
Optional: GTX560TI 448@stock/OC´d

Sys 2: I7 3930K@4,4GHz with Corsair H110
16GB G.Skill Ripjaws X DDR3 1866MHz CL 9-10-9-28
ASUS Ranpage IV Formula, Ubuntu 10.10

Sys 3 i7 875K@3,826 GHz with Scythe Mine2
8GB G.Skill Sniper DDR3 1866MHz CL 9-10-9-28
MSI P55-GD80, Win7 64Bit Pro
Sapphire Radeon HD5870@1,163V 900/1250MHz
Sapphire Radeon HD7870@1,218V 1200/1300MHz

Sys 4 i7 2600K@4,4GHz with Scythe Mine2
8GB G.Skill Sniper DDR3 1866MHz CL 9-10-9-28
MSI Z68A-GD65 (G3), various operating systems (WinXP, Ubuntu: 10.4.3 LTS, 12.04.2 LTS)
Optional: GTX560TI 448@stock/OC´d

Optional:
ASUS P5Q Pro with Q9550
ASUS P5Q Pro with Q6300
Location: Bavaria, Germany

Re: General Troubleshooting ideas

Post by folding_hoomer »

If you try to compare two systems don´t forget the influence of CPU-clocks.
As long as every GPU has it´s own CPU-Core PPD varies with different CPU-clocks, because:
- at every Core_17 GPU-WU-start the CPU is used to "initialize" the data for the GPU - the faster the CPU, the shorter the delay.
- while folding checkpoints were written - after the CPU has checked the integrity of former calculated data - with the same "result": the faster the CPU, the shorter the "pause" of the GPU.
This may result in a difference of several thousand PPD - using exactly the same GPU´s . . .
Image
ChasingTheDream
Posts: 56
Joined: Mon Jun 02, 2014 10:56 pm

Re: General Troubleshooting ideas

Post by ChasingTheDream »

P5-133XL wrote:If the two GPU's will run individually but not together, then power is by far the most likely cause of problems as long as temps stay reasonable.
I missed this earlier. To rule this out I've requested a power quality report from my utility company. They will run a report that tests voltage drops coming to the meter on the house. I'm sure it will boil down to them wanting to sell me "services" but if the power quality is bad it is selectively bad. For instance, all the machines that are having issues are powered by a new circuit breaker box that was put in specifically for this purpose, the machine I've refereed to as solid but running a different motherboard is powered by a circuit in the same circuit breaker box as all the machines with stability problems.


Also remember these machines were used for Scrypt mining prior to folding. They ran for weeks unattended while pulling considerably more power and did not have these issues. So unless folding uses power in a different way than mining I'm still baffled as to how it can be a power issue but I'll work to rule it out regardless.

Regarding temps, while mining these GPU's would hit 90C and stay there for days on end without issues. While folding I have not seen them higher than 78C and they seem quite stable regarding temperatures. They don't move much.

One thing that is "different" about the machine that runs well with two GPU's beside the different MB, PSU etc, is that it uses a cheap supplemental power supply because it's main PSU is not large enough to power two GPU's. I'm temped to pull a supplemental PSU and put it on one of the machines that are having issues just to see if there is any difference. The PSU's in the machines with issues are huge so it can't be a lack of power issue but as many have said it could be a dirty power issue although it would be highly selective.

I'm running out of troubleshooting time and I won't hear back from the power company for a week I suspect so I'm going to move a supplemental PSU and see what happens. I may grab a UPS and try that as well, but I won't do use the UPS until I see what happens with a supplemental PSU. I suspect that solution would be cheaper than whatever services the utility company wants to sell me.

Regarding the RAM speeds. I've run all the machines at 1333 for days and it actually seemed to make the machines more unstable. They are much more stable at 2133 and I ran the memory test at both 1333 and 2133 and there were no issues found in either test. The 2133 test ran for 9 hours if I remember right.

I've also switched the F@H client to verbose logging hoping to see some errors or warnings. So far there is nothing that indicates what the issue is. I've seen a couple warning about invalid work units. That is it.
Rel25917 wrote:Have you gone through the video card settings looking for power saving settings? I know on nvidia cards you would sometimes need to set the power management mode to prefer maximum performance to keep the clock speeds up. Not sure if amd cards have anything similar.
The only option I've familiar with along these lines for AMD is called ULPS (Ultra Low Power State Mode). I've disabled this long ago. Again, no improvement.
PantherX wrote:
ChasingTheDream wrote:....One other thing I've noticed is the two GPU systems with EVGA motherboards are running 20% - 30% slower (in PPD) compared to the one machine that will run for days running two of the same GPU's but has an ASUS motherboard. I've removed CPU folding on all the machines so it shouldn't be a bottleneck regarding cores or threads...
Assuming that the Projects were the same when you compared them, it seems that you might be saturating the PCI-E Lanes which may have a negative impact on the PPD (viewtopic.php?p=254816#p254816).

In your BIOS, can you disable additional features which you don't need, i.e. sound, etc? Also, see if changing the PCI-E Lane configuration changes the percentage difference in PPD.
I've switched PCI-E lanes in the BIOS and it made no difference so I switched it back. I went to PCI-E 3 to PCI-E 2 with no affect. I found it odd and rather humorous that the computer with the ASUS motherboard is actually only a PCI-E 2 board but it has run at a consistently higher estimated PPD and it more stable then the motherboards that are actually PCI-E 3. Of course the GPU's are PCI-E 3 as well.

Regarding disabling more features in the BIOS, I can start doing that. Unfortunately what the BIOS says and what it means isn't entirely obvious. The CPU power saving option said nothing about CPU power saving. I would have never known what it was. I actually got the setting from EVGA support.

I'll try to rule out the power before I start messing with the BIOS more.

The machine that is now using the "other" GPU is still running fine after about 12 hours.
ChasingTheDream
Posts: 56
Joined: Mon Jun 02, 2014 10:56 pm

Re: General Troubleshooting ideas

Post by ChasingTheDream »

folding_hoomer wrote:If you try to compare two systems don´t forget the influence of CPU-clocks.
As long as every GPU has it´s own CPU-Core PPD varies with different CPU-clocks, because:
- at every Core_17 GPU-WU-start the CPU is used to "initialize" the data for the GPU - the faster the CPU, the shorter the delay.
- while folding checkpoints were written - after the CPU has checked the integrity of former calculated data - with the same "result": the faster the CPU, the shorter the "pause" of the GPU.
This may result in a difference of several thousand PPD - using exactly the same GPU´s . . .

The machines that are unstable are all using I7-4770K CPU's that are not overclocked. The stable machine with all the difference components is using a I7-2600K CPU that runs slower although I can't remember the exact speed off the top of my head. The "slower" CPU machine reports the higher speeds and it also uses PCI-E 2 instead of PCI-E 3.
davidcoton
Posts: 1094
Joined: Wed Nov 05, 2008 3:19 pm
Location: Cambridge, UK

Re: General Troubleshooting ideas

Post by davidcoton »

ChasingTheDream wrote:
P5-133XL wrote:If the two GPU's will run individually but not together, then power is by far the most likely cause of problems as long as temps stay reasonable.
I missed this earlier....
It's not a matter of utility power, and it is something that may have been considered already, though the evidence suggests a revisit might be necessary.
The power issue is if one rail of the PSU is being overloaded with both GPUs folding. The earlier suggestion was to try running each power input of each GPU from a separate PSU connector. That way voltage drops on each cable/connector are minimised. If the PSU 12V is single rail, good. If not the load must be spread across the rails -- any one rail could be overloaded. But mining should have shown this -- possibly the parts of the GPU used for mining are more tolerant than those used for folding.

A supplementary PSU could also solve this problem (if it is a problem), but separate power cables might be cheaper and neater.

Failing that I'm afraid it looks like a hardware/driver compatibility issue, somewhere between mobo and gpus. Good luck trying to get anyone to investigate or admit it's their problem :cry:

David
Image
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: General Troubleshooting ideas

Post by P5-133XL »

I agree it is unlikely a utility power issues. That would show up even with one GPU running.
Image
ChasingTheDream
Posts: 56
Joined: Mon Jun 02, 2014 10:56 pm

Re: General Troubleshooting ideas

Post by ChasingTheDream »

davidcoton wrote: It's not a matter of utility power, and it is something that may have been considered already, though the evidence suggests a revisit might be necessary.
The power issue is if one rail of the PSU is being overloaded with both GPUs folding. The earlier suggestion was to try running each power input of each GPU from a separate PSU connector. That way voltage drops on each cable/connector are minimised. If the PSU 12V is single rail, good. If not the load must be spread across the rails -- any one rail could be overloaded. But mining should have shown this -- possibly the parts of the GPU used for mining are more tolerant than those used for folding.

A supplementary PSU could also solve this problem (if it is a problem), but separate power cables might be cheaper and neater.

Failing that I'm afraid it looks like a hardware/driver compatibility issue, somewhere between mobo and gpus. Good luck trying to get anyone to investigate or admit it's their problem :cry:

David
Yeah this came up earlier but I'll go over it again real quick.

Three of the unstable machines use a 1200W Corsair AX1200i PSU. It is a single rail PSU.
The other three unstable machines use a Antec HCP-1300 Platinum 1300W ATX12V / EPS12V PSU. This PSU is a 4 rail PSU.
The symptoms across all the systems are exactly the same.

The PSU packages do not come with enough single plug cables to try them on all machines but I can do it on a a couple (one with each PSU type).
Some people also suspected I could be getting brown outs and that was causing my power issues but I did hear back from the my power company and they said we do not have brown outs in this area. I've never experienced any that I knew of but I had to rule it out. They have assured me the power in the my area is within their expectations of a 5% variance.

I looked into trying a UPS just to rule it but I would need either a Line Interactive or Online UPS to really know the power coming out of the UPS is exactly what is expected and I'm not going to go through that expense just to find out.

I'm winding down on this effort. I've been fighting it for over four weeks now and have tried dozens of things. If I can't get things working in the next few days (via PSU cables or cheap supplemental PSU's) I think I'm going to sell off the second 290X TRI-X cards in all six of the machines that are having this issue and just wait for the newer NVIDIA cards to come out later this year or early next year. Once I get the new NVIDIA cards I'll sell off the remaining 290X TRI-X cards. I'll have the most overpowered single GPU systems in the world but at least they should be stable.

To say I'm not impressed with the AMD GPU's would be an understatement. When mining I had all sorts of compatibility issues with these cards as well and ironically to get them to work I swapped out all ASUS motherboards on these same machines to the EVGA motherboards. That is why I'm not willing to tear all the machines apart again.

In any event, this is going to come to a conclusion real soon.
ChasingTheDream
Posts: 56
Joined: Mon Jun 02, 2014 10:56 pm

Re: General Troubleshooting ideas

Post by ChasingTheDream »

Quick update: I swapped PSU cables on two machines late last night to single plug cables and spread them out more on where they plug into the PSU. Both machines were down again this morning. So it apparently didn't make a difference. Tonight I'll try a supplemental PSU but I have to rob it from another machine to use so I'll only try it for a day or two.

The machine that is now folding on the second GPU with the first GPU idle is still working fine. No incidents.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: General Troubleshooting ideas

Post by bruce »

Start making a list. Assuming either GPU operates by itself but together they do not, which things are used (harder) when both GPUs are operational and which things are not. Check off the ones in Common and Mixed that you've eliminated. A written list is a good way to decide if you've included everything.


Separate
* The GPU, itself.
* FAHCore_xx
* CPU resources
...

Common
* Power Supply
* Cooling subsystem
* Drivers
...

Mixed (part Separate, and part Common)
* RAM and RAM drivers and RAM bus
* PCIe
...
ChasingTheDream
Posts: 56
Joined: Mon Jun 02, 2014 10:56 pm

Re: General Troubleshooting ideas

Post by ChasingTheDream »

Another update. This is specifically for the machine that ran for days on a single GPU and then switched and ran days again on the 2nd GPU.

Earlier today I enabled both GPU's again and installed a supplemental PSU to power one GPU. I used the same supplemental PSU that is used in the machine I've talked about before with two GPU's and is solid as a rock although it uses older technology. In any event, with both GPU's enabled and a supplemental PSU I couldn't keep the machine running for more than 30 minutes without fatal AMD driver failures. I then removed the supplemental PSU and used the PSU in the machine to power both GPU's and it ran for 4-5 hours before a GPU hung again due to driver errors. So it is back to the same patten it was showing before I limited the machine to one GPU at a time. GPU temps have not exceeded 78C on either GPU according to both MSI AB and GPU-Z. CPU temps have never gotten higher than 50C.

My wife has already voted though and that is get rid of the GPU's although she used much more colorful words. She can not believe how much time I have put into this and not made any progress and in all honesty neither can I. So I'm at the point where I'm strongly thinking of just selling off the 2nd 290X TRI-X cards in the machines that are having the issues.

@Bruce. I appreciate the list but we have been through the stress testing and multiple other tests on these items multiple times. I think I've got an incompatibility between my motherboard, AMD drivers, and the F@H client. Maybe at some point in the future an update would fix it. Then again maybe not. I've got one more week to figure out what I'm going to do but if I decide to sell the GPU's I need to do it soon. My spare time will drop to next to nothing in 10 days. I am just not going to have time to mess around with this any more. The machines need to be able to run without me being ready to hit a reset button every few hours.
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: General Troubleshooting ideas

Post by PantherX »

Given the time constraint, this is what I would do:
Set F@H to finish the WUs then do a complete uninstallation (option to delete data is selected). Power down the systems and remove the second GPU. Boot up the systems and ensure all the updates are applied and drivers are up-to-date. Install V7.4.4, configure it correctly (including remote monitoring) and start folding. Hopefully, all 6 systems can fold without any issues. Worst case is that if a system is acting up, you can fix it within this week since you will not have any spare time afterwards.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Post Reply