General Troubleshooting ideas
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 56
- Joined: Mon Jun 02, 2014 10:56 pm
General Troubleshooting ideas
Hello,
I'm new to folding and have been running into a few things that have me a bit puzzled. I've got 9 total computers and 6 of them are identical with I7-4770K processors, 16 gig ram, and 2 290X TRIX cards in each.
What I have been running into was what appeared to be hung WU's that would hang at 99.99%. After doing some research I read that it was likely due to GPU driver failues and overclocking on my video cards. After digging through the logs I could see the completed percentages were not in sync with the client and my GPU's are factory overclocked. Consequently I started down clocking each card any time I saw an out of sync condition in small increments to find the speeds at which they would run.
Slowly I got the machines more stable after under clocking them significantly. They would run for 24 hours without some sort of intervention. However, I've still had two computers that just aren't happy. They are having constant issues even at much lower GPU clock speeds (core speed and memory speed) than the other machines. So I thought, I wonder if these machines will run at even the slowest clock speeds allowed for my GPU's in MSI afterburner and sure enough within 24 hours those machines were hung up again. Not to mention when they do hang up like this I have to literally press the reset button on the computer. They will not reboot on their own so I can't do it remotely.
I've done fresh installs of Windows 7 x64, I'm using Catalyst 14.4 on all the machine. I just can't seem to get these two machines to behave and I'm running out of ideas. I read to try running them with no CPU slot so I'm letting the WU finish and I'm trying that. I'll know more in 24 hours or so.
If this doesn't work and the machines are already literally running at half regular speed, I'm really not sure what else to check. I did read I should try to get the PCI-E slots in the BIOS from auto down to 2 and see if that helps but that seems to be reaching a bit since the other machines don't seem to be having the same isses.
I also suspected the GPU's themselves were having issues so I literally swapped the GPU's from one of my other machines into one of the "troubled" machines and that same machine still had the issues.
So my questions are do you think I'm even on the right track?
How long should I expect the FAH client to run before issues like the ensure? Maybe reboots every 12 to 24 hours are just needed and I'm expecting too much.
How often should I reboot all the machines?
I know you need my logs in order for any real troubleshooting to occur and I'll get getting those tomorrow. I'm just curious if I'm even on the right track.
Thanks!
I'm new to folding and have been running into a few things that have me a bit puzzled. I've got 9 total computers and 6 of them are identical with I7-4770K processors, 16 gig ram, and 2 290X TRIX cards in each.
What I have been running into was what appeared to be hung WU's that would hang at 99.99%. After doing some research I read that it was likely due to GPU driver failues and overclocking on my video cards. After digging through the logs I could see the completed percentages were not in sync with the client and my GPU's are factory overclocked. Consequently I started down clocking each card any time I saw an out of sync condition in small increments to find the speeds at which they would run.
Slowly I got the machines more stable after under clocking them significantly. They would run for 24 hours without some sort of intervention. However, I've still had two computers that just aren't happy. They are having constant issues even at much lower GPU clock speeds (core speed and memory speed) than the other machines. So I thought, I wonder if these machines will run at even the slowest clock speeds allowed for my GPU's in MSI afterburner and sure enough within 24 hours those machines were hung up again. Not to mention when they do hang up like this I have to literally press the reset button on the computer. They will not reboot on their own so I can't do it remotely.
I've done fresh installs of Windows 7 x64, I'm using Catalyst 14.4 on all the machine. I just can't seem to get these two machines to behave and I'm running out of ideas. I read to try running them with no CPU slot so I'm letting the WU finish and I'm trying that. I'll know more in 24 hours or so.
If this doesn't work and the machines are already literally running at half regular speed, I'm really not sure what else to check. I did read I should try to get the PCI-E slots in the BIOS from auto down to 2 and see if that helps but that seems to be reaching a bit since the other machines don't seem to be having the same isses.
I also suspected the GPU's themselves were having issues so I literally swapped the GPU's from one of my other machines into one of the "troubled" machines and that same machine still had the issues.
So my questions are do you think I'm even on the right track?
How long should I expect the FAH client to run before issues like the ensure? Maybe reboots every 12 to 24 hours are just needed and I'm expecting too much.
How often should I reboot all the machines?
I know you need my logs in order for any real troubleshooting to occur and I'll get getting those tomorrow. I'm just curious if I'm even on the right track.
Thanks!
Last edited by ChasingTheDream on Tue Jun 03, 2014 4:00 am, edited 2 times in total.
-
- Posts: 10179
- Joined: Thu Nov 29, 2007 4:30 pm
- Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
- Location: Arizona
- Contact:
Re: General Troubleshooting ideas
On the right track. Also interested in System and GPU temps while running. And what size PSU in those systems? The logs should do the rest.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Tell me and I forget. Teach me and I remember. Involve me and I learn.
-
- Posts: 2948
- Joined: Sun Dec 02, 2007 4:36 am
- Hardware configuration: Machine #1:
Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).
Machine #2:
Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.
Machine 3:
Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32
I am currently folding just on the 5x GTX 460's for aprox. 70K PPD - Location: Salem. OR USA
Re: General Troubleshooting ideas
Check your Windows event logs for problems. Post problem folding logs.
You should not need to reboot at all, if your machines are running correctly.
PCI-e bus speed unlikely to be a problem unless you have OC'ed it.
On your problem machines:
Display real-time CPU (temp. task manager) and GPU (clock, %GPU usage, Temp) stats in the systray. Just generally a good idea and over time you'll get an idea of normal and be able to spot problems with a glance.
Check your PS and specifically make sure you are not overloading a rail. This is the most likely issue. You've powered the different machines video cards with different rails which overloads some rails but not others.
Memtest86+. Everything goes through RAM, video cards included.
under-clock CPU (everything also goes through the CPU, including video card data) and possibly under-clock RAM or make the timings more lenient.
You should not need to reboot at all, if your machines are running correctly.
PCI-e bus speed unlikely to be a problem unless you have OC'ed it.
On your problem machines:
Display real-time CPU (temp. task manager) and GPU (clock, %GPU usage, Temp) stats in the systray. Just generally a good idea and over time you'll get an idea of normal and be able to spot problems with a glance.
Check your PS and specifically make sure you are not overloading a rail. This is the most likely issue. You've powered the different machines video cards with different rails which overloads some rails but not others.
Memtest86+. Everything goes through RAM, video cards included.
under-clock CPU (everything also goes through the CPU, including video card data) and possibly under-clock RAM or make the timings more lenient.
-
- Posts: 56
- Joined: Mon Jun 02, 2014 10:56 pm
Re: General Troubleshooting ideas
Hey guys thanks for the quick response!
On the problem machines one of the PSU's is a 1200 watt Corsair and on the other 1300 watt Antec. The Corsair is a single rail PSU and ironically that is the machine that has the most trouble. The Antec is a 4 rail PSU I believe but the two machines next to it have the same PSU and the same connections are used. I can still move the connections though. Just to see what happens. All the machines are ridiculously overpowered in all honestly but that is another story...
There is a CPU temp gauge on each machine and all the CPU's are liquid cooled. I've never seen the CPU temps higher than 50 C and they are not overclocked. GPU % utilization according to MSI Afterbuner is usually above 90% and most of time shows 100%. Every now and then I see the dips but it doesn't tend to last long. The GPU temps run around 78 C. In fact, I've never seen them higher.
Great suggestions guys! I'm going to go down and look at the RAM because the RAM clock speeds are 2133 in all the machines. So that suggestion tripped a red alert in my head.
I'll underclock it and see what happens. I would have never thought of that!
I'll be back with logs next time unless the RAM fixes it.
Thanks again!
On the problem machines one of the PSU's is a 1200 watt Corsair and on the other 1300 watt Antec. The Corsair is a single rail PSU and ironically that is the machine that has the most trouble. The Antec is a 4 rail PSU I believe but the two machines next to it have the same PSU and the same connections are used. I can still move the connections though. Just to see what happens. All the machines are ridiculously overpowered in all honestly but that is another story...
There is a CPU temp gauge on each machine and all the CPU's are liquid cooled. I've never seen the CPU temps higher than 50 C and they are not overclocked. GPU % utilization according to MSI Afterbuner is usually above 90% and most of time shows 100%. Every now and then I see the dips but it doesn't tend to last long. The GPU temps run around 78 C. In fact, I've never seen them higher.
Great suggestions guys! I'm going to go down and look at the RAM because the RAM clock speeds are 2133 in all the machines. So that suggestion tripped a red alert in my head.

I'll be back with logs next time unless the RAM fixes it.
Thanks again!
Re: General Troubleshooting ideas
Try running a single card at a time. If they can run one card stable but hang running 2 you might try different power supplies. You may have got lucky and got 2 flaky power supplies.
Reboots are not needed often if the hardware is solid. The computer I'm typing this on has been running for 120 days straight no reboots and running the cpu client for much of that time. It's a win xp box with an intel e8400 overclocked to 4ghz from 3. I have had this computer run without a reboot for almost a full year before it got shutoff due to extended power outage(small ups get it through short losses).
Dips in gpu usage are normal when they write a checkpoint.
Reboots are not needed often if the hardware is solid. The computer I'm typing this on has been running for 120 days straight no reboots and running the cpu client for much of that time. It's a win xp box with an intel e8400 overclocked to 4ghz from 3. I have had this computer run without a reboot for almost a full year before it got shutoff due to extended power outage(small ups get it through short losses).
Dips in gpu usage are normal when they write a checkpoint.
-
- Posts: 704
- Joined: Tue Dec 04, 2007 6:56 am
- Hardware configuration: Ryzen 7 5700G, 22.40.46 VGA driver; 32GB G-Skill Trident DDR4-3200; Samsung 860EVO 1TB Boot SSD; VelociRaptor 1TB; MSI GTX 1050ti, 551.23 studio driver; BeQuiet FM 550 PSU; Lian Li PC-9F; Win11Pro-64, F@H 8.3.5.
[Suspended] Ryzen 7 3700X, MSI X570MPG, 32GB G-Skill Trident Z DDR4-3600; Corsair MP600 M.2 PCIe Gen4 Boot, Samsung 840EVO-250 SSDs; VelociRaptor 1TB, Raptor 150; MSI GTX 1050ti, 526.98 driver; Kingwin Stryker 500 PSU; Lian Li PC-K7B. Win10Pro-64, F@H 8.3.5. - Location: @Home
- Contact:
Re: General Troubleshooting ideas
Are you limiting CPU/SMP Folding to 6 cores? If not, try it. The periodic dips in GPU load may be due to CPU congestion...
Ryzen 7 5700G, 22.40.46 VGA driver; MSI GTX 1050ti, 551.23 studio driver
Ryzen 7 3700X; MSI GTX 1050ti, 551.23 studio driver [Suspended]
Ryzen 7 3700X; MSI GTX 1050ti, 551.23 studio driver [Suspended]
-
- Posts: 2948
- Joined: Sun Dec 02, 2007 4:36 am
- Hardware configuration: Machine #1:
Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).
Machine #2:
Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.
Machine 3:
Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32
I am currently folding just on the 5x GTX 460's for aprox. 70K PPD - Location: Salem. OR USA
-
- Posts: 56
- Joined: Mon Jun 02, 2014 10:56 pm
Re: General Troubleshooting ideas
Thanks again for all the suggestions!
Here is a little more background. The machines were used for quite intensive (and wasteful) purposes prior to being switched to F@H. The machines had been running for a couple months as Scrypt miners with GPU temps in the upper 80C's and a 30% higher power draw than they are currently using. They ran for weeks on end with no issues. It is certainly possible the power supplies are now bad but the timing would be incredible since it started when I switched to F@H a few weeks ago.
I've got two machines that simply won't run more than 12 hours without an issue. It seems a 3rd machine has become quite unstable as well. It actually seems that down clocking the RAM made it angry. I've looked at Windows event viewer and no errors or even warning are present at the time of the issues. Additionally, today on the most unstable machine I was watching the GPU temps via GPU-Z to see if the temps showing in MSI Afterburner were agreeing with it. I was folding at the same time and as I watched I saw a video driver failure. So out of curiosity I checked to see if I still had mouse and keyboard input capability and I did. I also restarted the computer remotely and it worked as expected.
The reason I found that interesting is because when the "problems" occur I can still usually move the mouse, but the F@H client will not respond to the clicks or any keyboard input. If fact, I can't even use the Windows start button to initiate a system restart or CNTL-ALT-DEL. I have seen the systems completely lock as well but the vast majority of the time when the system becomes unresponsive the conditions are always the same and this is happening on three different computers consistently. The only thing I can do is hit the reset button.
I'm baffled by it. There must be a reason. I down clocked the RAM from 2133 to 1333 and it made no difference. I've moved GPU's from stable machines to the most unstable machine. It didn't make a difference. The CPU's aren't folding at this point at all (I did let the WU's finish before removing their slots), but so far nothing has made a difference. When I look in the F@H logs and check warning and errors, I have never seen anything displayed. Whatever is causing the instability has never really left a sign that I have seen but the GUI and the logs are always out of sync when I see the instability.
In the morning, I'm sure I'll have 2-3 machines down again so I'll grab the F@H logs and figure out how to send them up here but honestly I haven't seen anything in them. I have not run the memory tests program mentioned before so I need to do that and I may take the memory from one machine that appears more stable and move it to the most unstable machine to see what happens. I know F@H uses memory differently than what the machines were doing before but the failure rate on the RAM would have to be awefully high to have issues on three of my nine computers and I've under clocked them significantly already. The RAM is rated for 2133 and it is now 1333.
I may remove one of the cards from each troubled machine just see if the machines will run "better" as well or try different slots. There has to be a reason and I'm guessing it is going to take quite a bit of time to figure it out. I thought reducing the GPU speeds would stabalize the machines and it did for most, but not the three in question. I've got the GPU's under clocked to about 60% of their factory settings.
One other interesting note. The most troubled machine has always shown an estimated PPD of half of the other machines. When you check the logs the points are what you would expect but it is always half in the GUI. I don't know if that is a clue or not so I thought I would mention it.
Thanks again!
Here is a little more background. The machines were used for quite intensive (and wasteful) purposes prior to being switched to F@H. The machines had been running for a couple months as Scrypt miners with GPU temps in the upper 80C's and a 30% higher power draw than they are currently using. They ran for weeks on end with no issues. It is certainly possible the power supplies are now bad but the timing would be incredible since it started when I switched to F@H a few weeks ago.
I've got two machines that simply won't run more than 12 hours without an issue. It seems a 3rd machine has become quite unstable as well. It actually seems that down clocking the RAM made it angry. I've looked at Windows event viewer and no errors or even warning are present at the time of the issues. Additionally, today on the most unstable machine I was watching the GPU temps via GPU-Z to see if the temps showing in MSI Afterburner were agreeing with it. I was folding at the same time and as I watched I saw a video driver failure. So out of curiosity I checked to see if I still had mouse and keyboard input capability and I did. I also restarted the computer remotely and it worked as expected.
The reason I found that interesting is because when the "problems" occur I can still usually move the mouse, but the F@H client will not respond to the clicks or any keyboard input. If fact, I can't even use the Windows start button to initiate a system restart or CNTL-ALT-DEL. I have seen the systems completely lock as well but the vast majority of the time when the system becomes unresponsive the conditions are always the same and this is happening on three different computers consistently. The only thing I can do is hit the reset button.
I'm baffled by it. There must be a reason. I down clocked the RAM from 2133 to 1333 and it made no difference. I've moved GPU's from stable machines to the most unstable machine. It didn't make a difference. The CPU's aren't folding at this point at all (I did let the WU's finish before removing their slots), but so far nothing has made a difference. When I look in the F@H logs and check warning and errors, I have never seen anything displayed. Whatever is causing the instability has never really left a sign that I have seen but the GUI and the logs are always out of sync when I see the instability.
In the morning, I'm sure I'll have 2-3 machines down again so I'll grab the F@H logs and figure out how to send them up here but honestly I haven't seen anything in them. I have not run the memory tests program mentioned before so I need to do that and I may take the memory from one machine that appears more stable and move it to the most unstable machine to see what happens. I know F@H uses memory differently than what the machines were doing before but the failure rate on the RAM would have to be awefully high to have issues on three of my nine computers and I've under clocked them significantly already. The RAM is rated for 2133 and it is now 1333.
I may remove one of the cards from each troubled machine just see if the machines will run "better" as well or try different slots. There has to be a reason and I'm guessing it is going to take quite a bit of time to figure it out. I thought reducing the GPU speeds would stabalize the machines and it did for most, but not the three in question. I've got the GPU's under clocked to about 60% of their factory settings.
One other interesting note. The most troubled machine has always shown an estimated PPD of half of the other machines. When you check the logs the points are what you would expect but it is always half in the GUI. I don't know if that is a clue or not so I thought I would mention it.
Thanks again!
-
- Posts: 2948
- Joined: Sun Dec 02, 2007 4:36 am
- Hardware configuration: Machine #1:
Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).
Machine #2:
Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.
Machine 3:
Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32
I am currently folding just on the 5x GTX 460's for aprox. 70K PPD - Location: Salem. OR USA
Re: General Troubleshooting ideas
When supplying logs, simply run the advanced Control, go to the log tab, click refresh, click copy and over here click paste. makr sure there is [code] and [/code] around the log that was pasted.
Since you are remoting in, are you using Microsoft's remote desktop?
Since you are remoting in, are you using Microsoft's remote desktop?
-
- Posts: 56
- Joined: Mon Jun 02, 2014 10:56 pm
Re: General Troubleshooting ideas
Thanks for the info on adding the logs.
I use TeamViewer for my remote connections.
In fact, that is always my indicator that I've got a problem because I either can't connect to the remote machine or it is extremely slow. At that point I know that machine needs a restart which almost always means reset.
I try to run the little circuit as often as I can through my machines to see if the logs and the GUI agree. When they don't, I would down clock the GPU's a bit and then reboot the machine. I've read that I could just stop and then start folding again but the stability issues have made me just go with the reboot.
I saw that the F@H client has remote connection capability built in but I haven't had the time to get to it yet due to the other issues.
I use TeamViewer for my remote connections.
In fact, that is always my indicator that I've got a problem because I either can't connect to the remote machine or it is extremely slow. At that point I know that machine needs a restart which almost always means reset.
I try to run the little circuit as often as I can through my machines to see if the logs and the GUI agree. When they don't, I would down clock the GPU's a bit and then reboot the machine. I've read that I could just stop and then start folding again but the stability issues have made me just go with the reboot.
I saw that the F@H client has remote connection capability built in but I haven't had the time to get to it yet due to the other issues.
Re: General Troubleshooting ideas
The remote connection capability is well worth spending some time on at your earliest convenience.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 10179
- Joined: Thu Nov 29, 2007 4:30 pm
- Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
- Location: Arizona
- Contact:
Re: General Troubleshooting ideas
Run only one GPU slot in the problem systems as recommended earlier. Does that one slot run at half speed?
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Tell me and I forget. Teach me and I remember. Involve me and I learn.
-
- Site Moderator
- Posts: 6986
- Joined: Wed Dec 23, 2009 9:33 am
- Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB
Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400 - Location: Land Of The Long White Cloud
- Contact:
Re: General Troubleshooting ideas
Welcome to the F@H Forum ChasingTheDream,
You have stated that out of 9 systems, 6 are identical. Thus, the 3 systems that are problematic, are they identical or not? Moreover, if you can swap parts (CPU, GPU, PSU, RAM, MOBO) with a good system, you can potentially identify any hardware issue.
I use TeamViewer and this doesn't cause any issues while folding on GPUs in Windows and Ubuntu.
Are all of your 9 systems dedicated to folding or are they multipurpose (file server, print server, etc)? If they are multipurpose, what other applications were running when those problematic systems hung up?
You have stated that out of 9 systems, 6 are identical. Thus, the 3 systems that are problematic, are they identical or not? Moreover, if you can swap parts (CPU, GPU, PSU, RAM, MOBO) with a good system, you can potentially identify any hardware issue.
I use TeamViewer and this doesn't cause any issues while folding on GPUs in Windows and Ubuntu.
Are all of your 9 systems dedicated to folding or are they multipurpose (file server, print server, etc)? If they are multipurpose, what other applications were running when those problematic systems hung up?
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
-
- Posts: 349
- Joined: Sun Feb 10, 2013 6:06 pm
- Hardware configuration: Sys 1: I7 2700K@4,4GHz with NH-C14
8GB G.Skill Sniper DDR3 1866MHz CL 9-10-9-28
MSI Z68A-GD65 (G3), various operating systems (WinXP, Ubuntu: 10.4.3 LTS, 12.04.2 LTS)
Optional: GTX560TI 448@stock/OC´d
Sys 2: I7 3930K@4,4GHz with Corsair H110
16GB G.Skill Ripjaws X DDR3 1866MHz CL 9-10-9-28
ASUS Ranpage IV Formula, Ubuntu 10.10
Sys 3 i7 875K@3,826 GHz with Scythe Mine2
8GB G.Skill Sniper DDR3 1866MHz CL 9-10-9-28
MSI P55-GD80, Win7 64Bit Pro
Sapphire Radeon HD5870@1,163V 900/1250MHz
Sapphire Radeon HD7870@1,218V 1200/1300MHz
Sys 4 i7 2600K@4,4GHz with Scythe Mine2
8GB G.Skill Sniper DDR3 1866MHz CL 9-10-9-28
MSI Z68A-GD65 (G3), various operating systems (WinXP, Ubuntu: 10.4.3 LTS, 12.04.2 LTS)
Optional: GTX560TI 448@stock/OC´d
Optional:
ASUS P5Q Pro with Q9550
ASUS P5Q Pro with Q6300 - Location: Bavaria, Germany
Re: General Troubleshooting ideas
Did you enable the sleep states for the CPU?
If so, please disable them (incl. C1) and check the stability again.
If so, please disable them (incl. C1) and check the stability again.
-
- Posts: 56
- Joined: Mon Jun 02, 2014 10:56 pm
Re: General Troubleshooting ideas
Thanks for the welcome and the help! A log is at the bottom of this message.
The three machines I have the most trouble with are part of the six that are essentially identical. I say essentially, because three have 1200 watt Corsair PSU's and Samsung SSD's, and three have 1300 watt Antec PSU's and intel SSD's. Of the most unstable machines two have Corsair PSU's and Samsung SSD, and one has an Antec PSU and Intel SSD. Other than that they are the same. Same CPU, same memory, same MB, same GPU's.
The six identical (nearly) machines are all dedicated folding machines. They do nothing else and only run the F@H client, Norton, MSI Afterbuner, and TeamViewer.
The three remaining machines are multi-purpose machines and ironically they are by far the most stable, even though I use the same F@H settings (FULL) and the same GPU's (in two of them). They are also much older machines.
Last night I rebooted all the machines prior to going to bed and for the first time I actually saw an PPD estimate higher than "half" on the most troubled machine. I have no idea why after three weeks it suddenly has a higher estimate. To be clear the estimate never did match reality. The logs showed it was running at the same pace as the other machines. It's just that the estimates were always half of reality. No idea why.
I have moved GPU's from one machine to another and it made no difference. I moved memory from one machine to another and it made no difference. In all honesty I doubt I will disassemble the machines and swap motherboards etc to try to find the components that are causing issues. None of the 6 machines will run more than 48 hours without issues. Some of them have a major victory if they run 12 hours. Realistically I can't really swap all that hardware. It would be far too costly and time consuming so I'll most likely just have to write some automation that attempts to force the machines to reboot when it detects an "out of sync" condition between the logs and the GUI in the F@H client so I can get the most out of what I have.
Any advice you could give me on that would be great! Right now I'm thinking of reading in the log and checking how much time has passed between WU updates and if the time is exceeded by say double the last amount then reboot the machine. I haven't actually started but those are some of my thoughts. I've also seen some people are using a program called ProcessLasso to essentially do the same type of thing but they are monitoring CPU utilization for the FAH processes.
Here is a log from a machine that was hard locked this morning. Ironically it is actually one of the more stable machines. I rebooted all machines before I went to sleep last night and that seemed to help them run for 12 hours. I obviously still had one machine down but it is better than 2-3 but I did have a 99.9% out of sync condition on another machine as well. It just didn't lock it up this time.
The three machines I have the most trouble with are part of the six that are essentially identical. I say essentially, because three have 1200 watt Corsair PSU's and Samsung SSD's, and three have 1300 watt Antec PSU's and intel SSD's. Of the most unstable machines two have Corsair PSU's and Samsung SSD, and one has an Antec PSU and Intel SSD. Other than that they are the same. Same CPU, same memory, same MB, same GPU's.
The six identical (nearly) machines are all dedicated folding machines. They do nothing else and only run the F@H client, Norton, MSI Afterbuner, and TeamViewer.
The three remaining machines are multi-purpose machines and ironically they are by far the most stable, even though I use the same F@H settings (FULL) and the same GPU's (in two of them). They are also much older machines.
Last night I rebooted all the machines prior to going to bed and for the first time I actually saw an PPD estimate higher than "half" on the most troubled machine. I have no idea why after three weeks it suddenly has a higher estimate. To be clear the estimate never did match reality. The logs showed it was running at the same pace as the other machines. It's just that the estimates were always half of reality. No idea why.
I have moved GPU's from one machine to another and it made no difference. I moved memory from one machine to another and it made no difference. In all honesty I doubt I will disassemble the machines and swap motherboards etc to try to find the components that are causing issues. None of the 6 machines will run more than 48 hours without issues. Some of them have a major victory if they run 12 hours. Realistically I can't really swap all that hardware. It would be far too costly and time consuming so I'll most likely just have to write some automation that attempts to force the machines to reboot when it detects an "out of sync" condition between the logs and the GUI in the F@H client so I can get the most out of what I have.
Any advice you could give me on that would be great! Right now I'm thinking of reading in the log and checking how much time has passed between WU updates and if the time is exceeded by say double the last amount then reboot the machine. I haven't actually started but those are some of my thoughts. I've also seen some people are using a program called ProcessLasso to essentially do the same type of thing but they are monitoring CPU utilization for the FAH processes.
Here is a log from a machine that was hard locked this morning. Ironically it is actually one of the more stable machines. I rebooted all machines before I went to sleep last night and that seemed to help them run for 12 hours. I obviously still had one machine down but it is better than 2-3 but I did have a 99.9% out of sync condition on another machine as well. It just didn't lock it up this time.
Code: Select all
*********************** Log Started 2014-06-04T15:21:02Z ***********************
15:21:02:************************* Folding@home Client *************************
15:21:02: Website: http://folding.stanford.edu/
15:21:02: Copyright: (c) 2009-2014 Stanford University
15:21:02: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:21:02: Args:
15:21:02: Config: C:/Users/Miner8/AppData/Roaming/FAHClient/config.xml
15:21:02:******************************** Build ********************************
15:21:02: Version: 7.4.4
15:21:02: Date: Mar 4 2014
15:21:02: Time: 20:26:54
15:21:02: SVN Rev: 4130
15:21:02: Branch: fah/trunk/client
15:21:02: Compiler: Intel(R) C++ MSVC 1500 mode 1200
15:21:02: Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
15:21:02: /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
15:21:02: Platform: win32 XP
15:21:02: Bits: 32
15:21:02: Mode: Release
15:21:02:******************************* System ********************************
15:21:02: CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
15:21:02: CPU ID: GenuineIntel Family 6 Model 60 Stepping 3
15:21:02: CPUs: 8
15:21:02: Memory: 15.95GiB
15:21:02: Free Memory: 14.79GiB
15:21:02: Threads: WINDOWS_THREADS
15:21:02: OS Version: 6.1
15:21:02: Has Battery: false
15:21:02: On Battery: false
15:21:02: UTC Offset: -5
15:21:02: PID: 2416
15:21:02: CWD: C:/Users/Miner8/AppData/Roaming/FAHClient
15:21:02: OS: Windows 7 Home Premium
15:21:02: OS Arch: AMD64
15:21:02: GPUs: 2
15:21:02: GPU 0: ATI:5 Hawaii [Radeon R9 200X Series]
15:21:02: GPU 1: ATI:5 Hawaii [Radeon R9 200X Series]
15:21:02: CUDA: Not detected
15:21:02:Win32 Service: false
15:21:02:***********************************************************************
15:21:02:<config>
15:21:02: <!-- Network -->
15:21:02: <proxy v=':8080'/>
15:21:02:
15:21:02: <!-- Slot Control -->
15:21:02: <power v='FULL'/>
15:21:02:
15:21:02: <!-- User Information -->
15:21:02: <passkey v='********************************'/>
15:21:02: <team v='224497'/>
15:21:02: <user v='ChasingTheDream'/>
15:21:02:
15:21:02: <!-- Folding Slots -->
15:21:02: <slot id='1' type='GPU'/>
15:21:02: <slot id='2' type='GPU'/>
15:21:02:</config>
15:21:02:Trying to access database...
15:21:02:Successfully acquired database lock
15:21:02:Enabled folding slot 01: READY gpu:0:Hawaii [Radeon R9 200X Series]
15:21:02:Enabled folding slot 02: READY gpu:1:Hawaii [Radeon R9 200X Series]
15:21:02:WU01:FS01:Starting
15:21:02:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Miner8/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 01 -suffix 01 -version 704 -lifeline 2416 -checkpoint 15 -gpu 0 -gpu-vendor ati
15:21:02:WU01:FS01:Started FahCore on PID 3412
15:21:02:WU01:FS01:Core PID:3432
15:21:02:WU01:FS01:FahCore 0x17 started
15:21:02:WU02:FS02:Starting
15:21:02:WU02:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Miner8/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 02 -suffix 01 -version 704 -lifeline 2416 -checkpoint 15 -gpu 1 -gpu-vendor ati
15:21:02:WU02:FS02:Started FahCore on PID 3440
15:21:02:WU02:FS02:Core PID:3468
15:21:02:WU02:FS02:FahCore 0x17 started
15:21:02:WU01:FS01:0x17:*********************** Log Started 2014-06-04T15:21:02Z ***********************
15:21:02:WU01:FS01:0x17:Project: 13000 (Run 2283, Clone 0, Gen 7)
15:21:02:WU01:FS01:0x17:Unit: 0x00000015538b3db75312218466243e89
15:21:02:WU01:FS01:0x17:CPU: 0x00000000000000000000000000000000
15:21:02:WU01:FS01:0x17:Machine: 1
15:21:02:WU01:FS01:0x17:Digital signatures verified
15:21:02:WU01:FS01:0x17:Folding@home GPU core17
15:21:02:WU01:FS01:0x17:Version 0.0.52
15:21:02:WU02:FS02:0x17:*********************** Log Started 2014-06-04T15:21:02Z ***********************
15:21:02:WU02:FS02:0x17:Project: 13000 (Run 2029, Clone 2, Gen 26)
15:21:02:WU02:FS02:0x17:Unit: 0x00000039538b3db75311d9953885ed4a
15:21:02:WU02:FS02:0x17:CPU: 0x00000000000000000000000000000000
15:21:02:WU02:FS02:0x17:Machine: 2
15:21:02:WU02:FS02:0x17:Digital signatures verified
15:21:02:WU02:FS02:0x17:Folding@home GPU core17
15:21:02:WU02:FS02:0x17:Version 0.0.52
15:21:02:WU01:FS01:0x17: Found a checkpoint file
15:21:03:WU02:FS02:0x17: Found a checkpoint file
15:23:08:WU01:FS01:0x17:Completed 500000 out of 5000000 steps (10%)
15:23:08:WU01:FS01:0x17:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
15:23:08:WU02:FS02:0x17:Completed 3750000 out of 5000000 steps (75%)
15:23:08:WU02:FS02:0x17:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
15:30:06:WU01:FS01:0x17:Completed 550000 out of 5000000 steps (11%)
15:30:06:WU02:FS02:0x17:Completed 3800000 out of 5000000 steps (76%)
15:36:56:WU01:FS01:0x17:Completed 600000 out of 5000000 steps (12%)
15:36:56:WU02:FS02:0x17:Completed 3850000 out of 5000000 steps (77%)
15:43:56:WU01:FS01:0x17:Completed 650000 out of 5000000 steps (13%)
15:43:56:WU02:FS02:0x17:Completed 3900000 out of 5000000 steps (78%)
15:50:45:WU02:FS02:0x17:Completed 3950000 out of 5000000 steps (79%)
15:50:46:WU01:FS01:0x17:Completed 700000 out of 5000000 steps (14%)
15:57:34:WU02:FS02:0x17:Completed 4000000 out of 5000000 steps (80%)
15:57:35:WU01:FS01:0x17:Completed 750000 out of 5000000 steps (15%)
16:04:34:WU02:FS02:0x17:Completed 4050000 out of 5000000 steps (81%)
16:04:35:WU01:FS01:0x17:Completed 800000 out of 5000000 steps (16%)
16:11:23:WU02:FS02:0x17:Completed 4100000 out of 5000000 steps (82%)
16:11:24:WU01:FS01:0x17:Completed 850000 out of 5000000 steps (17%)
16:18:24:WU02:FS02:0x17:Completed 4150000 out of 5000000 steps (83%)
16:18:25:WU01:FS01:0x17:Completed 900000 out of 5000000 steps (18%)
16:25:12:WU02:FS02:0x17:Completed 4200000 out of 5000000 steps (84%)
16:25:13:WU01:FS01:0x17:Completed 950000 out of 5000000 steps (19%)
16:32:01:WU02:FS02:0x17:Completed 4250000 out of 5000000 steps (85%)
16:32:02:WU01:FS01:0x17:Completed 1000000 out of 5000000 steps (20%)
16:39:01:WU02:FS02:0x17:Completed 4300000 out of 5000000 steps (86%)
16:39:02:WU01:FS01:0x17:Completed 1050000 out of 5000000 steps (21%)
16:45:50:WU02:FS02:0x17:Completed 4350000 out of 5000000 steps (87%)
16:45:51:WU01:FS01:0x17:Completed 1100000 out of 5000000 steps (22%)
16:52:50:WU02:FS02:0x17:Completed 4400000 out of 5000000 steps (88%)
16:52:51:WU01:FS01:0x17:Completed 1150000 out of 5000000 steps (23%)
16:59:39:WU02:FS02:0x17:Completed 4450000 out of 5000000 steps (89%)
16:59:40:WU01:FS01:0x17:Completed 1200000 out of 5000000 steps (24%)
17:06:28:WU02:FS02:0x17:Completed 4500000 out of 5000000 steps (90%)
17:06:29:WU01:FS01:0x17:Completed 1250000 out of 5000000 steps (25%)
17:13:29:WU02:FS02:0x17:Completed 4550000 out of 5000000 steps (91%)
17:13:30:WU01:FS01:0x17:Completed 1300000 out of 5000000 steps (26%)
17:20:18:WU02:FS02:0x17:Completed 4600000 out of 5000000 steps (92%)
17:20:19:WU01:FS01:0x17:Completed 1350000 out of 5000000 steps (27%)
17:27:18:WU02:FS02:0x17:Completed 4650000 out of 5000000 steps (93%)
17:27:19:WU01:FS01:0x17:Completed 1400000 out of 5000000 steps (28%)
17:34:07:WU02:FS02:0x17:Completed 4700000 out of 5000000 steps (94%)
17:34:08:WU01:FS01:0x17:Completed 1450000 out of 5000000 steps (29%)
17:40:56:WU02:FS02:0x17:Completed 4750000 out of 5000000 steps (95%)
17:40:57:WU01:FS01:0x17:Completed 1500000 out of 5000000 steps (30%)