General Troubleshooting ideas

jrweiss · Post by **jrweiss** » Fri Jun 13, 2014 2:19 am

ChasingTheDream wrote:The machine that is now folding on the second GPU with the first GPU idle is still working fine. No incidents.

I'm just reaching here, but it could well go back to the 'possible Motherboard incompatibility' discussed earlier. Whether it's the MoBo, BIOS, drivers, or combination thereof, maybe the F@H core or client doesn't like the lane switching in the primary GPU from x16 to x8...

Might the primary GPU slot 'downshift' to x8 later in the boot sequence so that the primary GPU is initially reported as x16? Might there be something in the code somewhere that's looking for all 16 PCIe lanes, but burps when only 8 are found?

ChasingTheDream · Post by **ChasingTheDream** » Sat Jun 14, 2014 5:12 pm

I am running into more interesting things. I have not pulled GPU's yet and decided I can use my phone to manipulate the machines while away through TeamViewer which is what I use to get to all the machine right now anyway.

What is interesting right now is that the machine with single rail PSU's have suddenly become much more stable than the machines with the 4 rail PSU's. I'm starting to strongly suspect the WU's. I just don't see how machines can go from not being able to run a few hours to running days with no interventions while nothing in the hardware setup has changed.

On the 4 rail PSU's (1300 watt Antecs) systems I've made sure the two GPU's are on different rails but it doesn't seem to make any difference.

Another odd dynamic that has come up in the last few days is that single rail PSU systems allow for remote reboot when the GPU drivers fail. They would not do that for weeks before. They would hang on rebooting just like the 4 rail PSU systems and would require me to actually press the reset button. It is very odd and I would be tempted to say it must be the 4 rail PSU's but I witnessed the exact opposite behavior for weeks prior to this sudden change in behavior so I'm not sure what to make of it.

On a side note: What are your thoughts or initial thoughts of the GTX 880. I know it isn't out yet and it just a rumor but it appears some of the rumors are pretty specific. If / when that card comes out I'm planning on selling all my existing GPU's and replacing them with NVIDEA cards. It can't be more unstable than what I'm running now but I will also most likely only be running one GTX 880 per system rather than two.

Post by **bruce** » Sat Jun 14, 2014 5:25 pm

On the 4-rail PSUs, does each GPU pull from more than one rail? Have you measured the amps being used on individual rails?

Post by **PantherX** » Sat Jun 14, 2014 5:36 pm

ChasingTheDream wrote:...What are your thoughts or initial thoughts of the GTX 880. I know it isn't out yet and it just a rumor but it appears some of the rumors are pretty specific. If / when that card comes out I'm planning on selling all my existing GPU's and replacing them with NVIDEA cards. It can't be more unstable than what I'm running now but I will also most likely only be running one GTX 880 per system rather than two.

While I prefer to wait for the actual hardware release, do note that if those GPUs are based on the Maxwell platform, FahCore_17 can't run on them unless Nvidia fixes their OpenCL bug in their drivers. Once the fix is released, then you can see what performance it brings. Until then, if you have a Maxwell GPU, you will be assigned FahCore_15 WUs.

ChasingTheDream · Post by **ChasingTheDream** » Sat Jun 14, 2014 5:47 pm

bruce wrote:On the 4-rail PSUs, does each GPU pull from more than one rail? Have you measured the amps being used on individual rails?

Each GPU is on it's own rail. Unfortunately I don't know how to measure amps coming off individual rails.

PantherX wrote:While I prefer to wait for the actual hardware release, do note that if those GPUs are based on the Maxwell platform, FahCore_17 can't run on them unless Nvidia fixes their OpenCL bug in their drivers. Once the fix is released, then you can see what performance it brings. Until then, if you have a Maxwell GPU, you will be assigned FahCore_15 WUs.

I read about some problems with the Maxwell platform although not in any detail since I don't have the cards. I did see that the GTX 880 will be based on the Maxwell platform so I hope the issues are fixed prior to the release of the new cards. I jumped on these TRI-X cards prior to their release (pre-order) so I couldn't read about the issues people were having since I was among the first to have them and run into the issues. I certainly won't be in a huge hurry with the GTX 880.

ChasingTheDream · Post by **ChasingTheDream** » Sat Jun 14, 2014 7:32 pm

I have one machine that keeps having both GPU's stop. So out of curiosity I checked the WU's being run and none of my other machines are running these particular WU's and the other machines are running smoothly.

The WU's I'm seeing that seem to be having a difficult time running are 10466 and 10467. I actually thought I was having driver crashes but I don't see any messages in Window Event Viewer.

I am seeing the following messages in the logs:

*********************** Log Started 2014-06-14T19:21:36Z ***********************
19:22:17:WU02:FS00:0x17:WARNING:Console control signal 1 on PID 3136
19:22:17:WU00:FS01:0x17:WARNING:Console control signal 1 on PID 3164
19:22:21:WU02:FS00:0x17:ERROR:103: Lost client lifeline
19:23:18:WARNING:FS01:Killing WU00
19:23:19:WU02:FS00:0x17:WARNING:Console control signal 1 on PID 4548
19:24:01:WU02:FS00:0x17:ERROR:103: Lost client lifeline

Update: Issues with these WU's have been non-stop. The GPU's shut off almost as fast I can start them. Here is an error that I have found in event viewer.

Faulting application name: FahCore_17.exe, version: 0.0.0.0, time stamp: 0x527bdd17
Faulting module name: ntdll.dll, version: 6.1.7601.18247, time stamp: 0x521ea8e7
Exception code: 0xc0000374
Fault offset: 0x000ce753
Faulting process id: 0xc84
Faulting application start time: 0x01cf882a250cd73c
Faulting application path: C:\Users\Miner4\AppData\Roaming\FAHClient\cores\web.stanford.edu\~pande\Win32\AMD64\ATI\R600\Core_17.fah\FahCore_17.exe
Faulting module path: C:\Windows\SysWOW64\ntdll.dll
Report Id: 9edba10c-f41e-11e3-83a7-24fd521d30ae

I just walked downstairs to reset the machine and the it hardlocked literally within 30 seconds of rebooting. I don't think I can complete these WU's.

Here are some more errors and warnings in the log:

*********************** Log Started 2014-06-15T02:36:53Z ***********************
02:36:54:WARNING:WU03:FS02:WorkServer connection failed on port 8080 trying 80
02:36:54:WARNING:WU01:FS02:Exception: Could not get IP address for assign3.stanford.edu: No such host is known.
02:36:54:ERROR:WU01:FS02:Exception: Could not get an assignment
02:36:54:WARNING:WU01:FS02:Exception: Could not get IP address for assign3.stanford.edu: No such host is known.
02:36:54:ERROR:WU01:FS02:Exception: Could not get an assignment
02:36:55:WARNING:WU03:FS02:Exception: Failed to send results to work server: Failed to connect to 128.143.231.202:80: A socket operation was attempted to an unreachable network.
02:36:55:WARNING:WU03:FS02:WorkServer connection failed on port 8080 trying 80
02:36:55:ERROR:WU03:FS02:Exception: Failed to connect to 128.143.199.97:80: A socket operation was attempted to an unreachable network.
02:36:56:WARNING:WU03:FS02:WorkServer connection failed on port 8080 trying 80
02:36:56:WARNING:WU03:FS02:Exception: Failed to send results to work server: Failed to connect to 128.143.231.202:80: A socket operation was attempted to an unreachable network.
02:36:56:WARNING:WU03:FS02:WorkServer connection failed on port 8080 trying 80
02:36:56:ERROR:WU03:FS02:Exception: Failed to connect to 128.143.199.97:80: A socket operation was attempted to an unreachable network.
02:37:56:WARNING:WU03:FS02:Server did not like results, dumping

The WU's that appear to be getting dumped are actually CPU WU's though so it could be related to all the other issues.

Post by **PantherX** » Sun Jun 15, 2014 12:24 pm

ChasingTheDream wrote:...I am seeing the following messages in the logs:

*********************** Log Started 2014-06-14T19:21:36Z ***********************
19:22:17:WU02:FS00:0x17:WARNING:Console control signal 1 on PID 3136
19:22:17:WU00:FS01:0x17:WARNING:Console control signal 1 on PID 3164
19:22:21:WU02:FS00:0x17:ERROR:103: Lost client lifeline
19:23:18:WARNING:FS01:Killing WU00
19:23:19:WU02:FS00:0x17:WARNING:Console control signal 1 on PID 4548
19:24:01:WU02:FS00:0x17:ERROR:103: Lost client lifeline...

Something is causing the client to be terminated (viewtopic.php?p=164030#p164030).

ChasingTheDream wrote:...Issues with these WU's have been non-stop. The GPU's shut off almost as fast I can start them....I don't think I can complete these WU's...

To what Project are those WUs belonging to? Can you post the PRCGs of those WUs so we can check them?

ChasingTheDream wrote:...Here are some more errors and warnings in the log:

*********************** Log Started 2014-06-15T02:36:53Z ***********************
02:36:54:WARNING:WU03:FS02:WorkServer connection failed on port 8080 trying 80
02:36:54:WARNING:WU01:FS02:Exception: Could not get IP address for assign3.stanford.edu: No such host is known.
02:36:54:ERROR:WU01:FS02:Exception: Could not get an assignment
02:36:54:WARNING:WU01:FS02:Exception: Could not get IP address for assign3.stanford.edu: No such host is known.
02:36:54:ERROR:WU01:FS02:Exception: Could not get an assignment
02:36:55:WARNING:WU03:FS02:Exception: Failed to send results to work server: Failed to connect to 128.143.231.202:80: A socket operation was attempted to an unreachable network.
02:36:55:WARNING:WU03:FS02:WorkServer connection failed on port 8080 trying 80
02:36:55:ERROR:WU03:FS02:Exception: Failed to connect to 128.143.199.97:80: A socket operation was attempted to an unreachable network.
02:36:56:WARNING:WU03:FS02:WorkServer connection failed on port 8080 trying 80
02:36:56:WARNING:WU03:FS02:Exception: Failed to send results to work server: Failed to connect to 128.143.231.202:80: A socket operation was attempted to an unreachable network.
02:36:56:WARNING:WU03:FS02:WorkServer connection failed on port 8080 trying 80
02:36:56:ERROR:WU03:FS02:Exception: Failed to connect to 128.143.199.97:80: A socket operation was attempted to an unreachable network.
02:37:56:WARNING:WU03:FS02:Server did not like results, dumping

The WU's that appear to be getting dumped are actually CPU WU's though so it could be related to all the other issues.

With the exception of the last line, all messages indicate a connectivity issue. The last line states that the completed WU was rejected by the Server (could be a corruption in the transmission or some other reason). Without knowing what FS02 is and what the PRCG is, I can't say that it is the CPU Slot WUs. IIRC, you weren't folding on the CPU so did you enabled CPU folding again or not?

ChasingTheDream · Post by **ChasingTheDream** » Sun Jun 15, 2014 4:44 pm

I've enabled CPU folding on all machines again. My apologies. I should have mentioned it. What I found was running the GPU's at higher clock speeds (but still underclocked) seemed to make them more stable so then I thought what if I enable CPU folding again. CPU folding has not negatively impacted stability but this one machine simply is having all sorts of issues now and ironically it was one of the "stable" machines before. That is why I started looking at the work units because it is all that has been changing through this process.

The projects the GPU's are working on are 10466 and 10467. The machines has actually been trying to work on these two WU for well over 36 hours at this point. I just can't keep it running long enough to make any progress. 10466 is still at only 3% complete after 36 hours because of all the failures. I suspect it is actually the WU causing the issues. As I type this, the machine has had driver failures that I saw within 60 seconds of a restart four times in a row.

I have also seen this machine get the same messages regarding "dumping" the WU when it tried to upload it's last three CPU WU's that it completed. In fact, the clue that is has happened is I'll see a WU that is "new". So far it has always been the CPU WU. Not sure what is going on. It seems every day is an adventure with this stuff.

Here is a quick update. After the 5th reboot inside a few minutes of starting I'm now getting an index error when I launch the F@H client. It appears one of my GPU's is no longer recognized. So I need to go downstairs and see what is going on but I'm guessing I have a GPU that is failing. It truly is amazing the amount of work that goes into keeping these machines running at even a reduced pace. In any event, I'm off to see what happened this time.

Post by **PantherX** » Sun Jun 15, 2014 5:20 pm

ChasingTheDream wrote:I've enabled CPU folding on all machines again....

How exactly did you enable it?

ChasingTheDream wrote:...The projects the GPU's are working on are 10466 and 10467. The machines has actually been trying to work on these two WU for well over 36 hours at this point. I just can't keep it running long enough to make any progress. 10466 is still at only 3% complete after 36 hours because of all the failures. I suspect it is actually the WU causing the issues. As I type this, the machine has had driver failures that I saw within 60 seconds of a restart four times in a row...

That's the issue. You see Project 10466 and Project 10467 are FahCore_17 WUs which use the GPU. For some reason, it seems that your FAHClient is attempting to run them on the CPU which is an unsupported configuration and would explain all the issues that you are encountering. Could you please post the log file to see what exactly did FAHClient detect and is attempting to do.

ChasingTheDream wrote:...I have also seen this machine get the same messages regarding "dumping" the WU when it tried to upload it's last three CPU WU's that it completed. In fact, the clue that is has happened is I'll see a WU that is "new". So far it has always been the CPU WU. Not sure what is going on...

Considering that your CPU is attempting to process a GPU WU, I am not surprised that the Server is dumping them since it would be failing validation checks.

ChasingTheDream wrote:...It seems every day is an adventure with this stuff...

I believe that your systems are under attack by Gremlins.

ChasingTheDream wrote:...It truly is amazing the amount of work that goes into keeping these machines running at even a reduced pace. In any event, I'm off to see what happened this time.

Hopefully, this would be the last system which you need to troubleshoot. However, once you have all your systems up and folding, do post the log file with initial configuration and F@H settings of each system so ensure that they are configured correctly.

ChasingTheDream · Post by **ChasingTheDream** » Sun Jun 15, 2014 6:21 pm

PantherX wrote:How exactly did you enable it?

I went into configure ---> slots ---> add ---> ensured CPU was selected but specified 5 threads rather than letting the client decide. I've read you need a spare thread to feed the GPU's. I don't know if that is true or not but I always choose one less thread than the client would use.

PantherX wrote: That's the issue. You see Project 10466 and Project 10467 are FahCore_17 WUs which use the GPU. For some reason, it seems that your FAHClient is attempting to run them on the CPU which is an unsupported configuration and would explain all the issues that you are encountering. Could you please post the log file to see what exactly did FAHClient detect and is attempting to do.

I appears my explanation was a bit confusing so I'll try to clarify. The two GPU's were processing project 10466 and 10467. The CPU is processing project 6098 right now. I don't know what it was working on at the time of the uploads. It appears project 10466 simply would not work. I have no idea why. It could be the GPU was actually failing or something in the WU itself but I was getting driver failures within seconds of the F@H client starting.

After my last post, I went down to see what was happening. One of the GPU's was no longer recognized by Windows after repeated reboots. So I removed the 2nd GPU from the system. Of course that makes the F@H client angry because the configuration doesn't match what it expects anymore. There is no "slots" tab under configuration to make a correction when this happens so I did the following.

I closed the F@H client, went to the appropriate appdata directory and found config.xml. I assumed this is the file the client uses to figure out what the hardware setup is once it has been initialized. I removed the 2nd GPU from the file and launched the client. It still wasn't happy so I then assumed if the config.xml file couldn't be located by the client it would create a new one. Consequently I deleted the config.xml file and launched the client.

The client launched and saw that machine only has a single GPU now and the CPU so it proceeded to pickup where it left off on the WU with the exception of the WU being processed by the second GPU. In other words the GPU that was working on 10467 still is working on 10467. The CPU WU is 6098 and it picked up where it left off as well. I'm not sure what happened to 10466 that was on the second GPU, but I can say it wasn't actually being processed anyway.

Probably not the best way to do it but the F@H client seems happy again.

PantherX wrote:Considering that your CPU is attempting to process a GPU WU, I am not surprised that the Server is dumping them since it would be failing validation checks.

There was a miscommunication here. See above.

PantherX wrote:I believe that your systems are under attack by Gremlins.

Man it must be something!

PantherX wrote:Hopefully, this would be the last system which you need to troubleshoot. However, once you have all your systems up and folding, do post the log file with initial configuration and F@H settings of each system so ensure that they are configured correctly.

I don't expect them to run "smoothly" at this point. I suspect that just isn't going to happen. I would like to see just a few failures a day though rather than a few failures every hour.

Since I have removed the 2nd GPU in the machine that was failing every few seconds it has not had an issue since. It's only been a hour but it is solid with one GPU. I don't know if that is because of GPU removal or because WU 10466 is no longer being processed. After some time, I'll swap GPU's in that system and see what happens.

Regarding the logs, I've found the logs directory. Does posting a log show you what you need or do you need more than what is in the log files?

Post by **bruce** » Sun Jun 15, 2014 7:37 pm

Do you know if you have installed drivers that include support for Intel's OpenCL? ... or if it can be removed?

ChasingTheDream · Post by **ChasingTheDream** » Sun Jun 15, 2014 8:38 pm

bruce wrote:Do you know if you have installed drivers that include support for Intel's OpenCL? ... or if it can be removed?

I don't believe I do. I went out to look at what it is and it looks like it is part of an SDK for development. There is nothing like that on those machines. They only fold. They have the following:

Intel Network Connections
Intel Update Manager
Intel Management Engine Components
Intel Rapid Storage Technology
Intel USB 3.0 eXtensibile Host Control Driver

Those are from the programs listing in control panel. I can't say if the OpenCL drivers are actually in one of these products though.

Post by **PantherX** » Sun Jun 15, 2014 8:54 pm

ChasingTheDream wrote:...I went into configure ---> slots ---> add ---> ensured CPU was selected but specified 5 threads rather than letting the client decide. I've read you need a spare thread to feed the GPU's. I don't know if that is true or not but I always choose one less thread than the client would use...

That would be the correct method to add the CPU Slot. Regarding the free CPU per GPU rule, it varies on AMD system due to drivers. Thus, what you can do is to open up Task Manager and see how much CPU cycles is FahCore_17 using (exception is at the starting of the WU and during checkpoints). If it generally uses very little CPU cycles, you can use all CPUs for CPU folding. If you notice that it is using a significant amount of CPU cycles, then a free CPU would be sufficient for it.

ChasingTheDream wrote:...I appears my explanation was a bit confusing so I'll try to clarify. The two GPU's were processing project 10466 and 10467. The CPU is processing project 6098 right now. I don't know what it was working on at the time of the uploads. It appears project 10466 simply would not work. I have no idea why. It could be the GPU was actually failing or something in the WU itself but I was getting driver failures within seconds of the F@H client starting...

Okay, so you were folding on both GPUs and the CPU, thus, had three active folding slots when the issue occurred. This seems to the initial issue where 2 GPUs within the same system wouldn't fold, is that right?

ChasingTheDream wrote:...After my last post, I went down to see what was happening. One of the GPU's was no longer recognized by Windows after repeated reboots. So I removed the 2nd GPU from the system. Of course that makes the F@H client angry because the configuration doesn't match what it expects anymore. There is no "slots" tab under configuration to make a correction when this happens so I did the following.

I closed the F@H client, went to the appropriate appdata directory and found config.xml. I assumed this is the file the client uses to figure out what the hardware setup is once it has been initialized. I removed the 2nd GPU from the file and launched the client. It still wasn't happy so I then assumed if the config.xml file couldn't be located by the client it would create a new one. Consequently I deleted the config.xml file and launched the client...

That is weird. Did you make sure that FAHClient was running since Advanced Control (AKA FAHControl) can only display those tabs once it is connected with FAHClient? Generally, manually editing the config.xml isn't recommended since a typo might cause issues. However, once you deleted the config.xml, FAHClient re-detected the hardware and created the matching Slots.

ChasingTheDream wrote:...The client launched and saw that machine only has a single GPU now and the CPU so it proceeded to pickup where it left off on the WU with the exception of the WU being processed by the second GPU. In other words the GPU that was working on 10467 still is working on 10467. The CPU WU is 6098 and it picked up where it left off as well. I'm not sure what happened to 10466 that was on the second GPU, but I can say it wasn't actually being processed anyway...

That is good to hear that your CPU and GPU carried on folding WUs which were primarily assigned to it. Regarding the 2 GPU's WU, it will be either discarded or will be shifted to GPU 1's Slot once the current WU finishes. Do note that Project 10466 to Project 10469 were recently released to full F@H (viewtopic.php?f=24&t=26459).

ChasingTheDream wrote:...Since I have removed the 2nd GPU in the machine that was failing every few seconds it has not had an issue since. It's only been a hour but it is solid with one GPU. I don't know if that is because of GPU removal or because WU 10466 is no longer being processed. After some time, I'll swap GPU's in that system and see what happens...

Okay, so with a single GPU, you have been folding without issues on the CPU and GPU simultaneously? If so, that sounds good. Hopefully, it will continue like this for a long time.

ChasingTheDream wrote:...Regarding the logs, I've found the logs directory. Does posting a log show you what you need or do you need more than what is in the log files?

I initially asked for the log fine since I was wondering what happened to FAHClient that it assigned a GPU WU to a CPU Slot. However, that was a miscommunication and has been sorted out.

BTW, this is only 1 out of 6 systems. Are the other 5 systems folding fine with 1 GPU or are you only testing on a single system and once satisfied, will make the appropriate changes to the other 5 systems?

ChasingTheDream wrote:...I can't say if the OpenCL drivers are actually in one of these products though.

Unless Intel has changed stuff, it seems that you don't have OpenCL packages installed. Since I have installed it, this is what appears in the Programs and Features list:
Intel® SDK for OpenCL - CPU Only Runtime Package

ChasingTheDream · Post by **ChasingTheDream** » Mon Jun 16, 2014 2:42 am

PantherX wrote: That would be the correct method to add the CPU Slot. Regarding the free CPU per GPU rule, it varies on AMD system due to drivers. Thus, what you can do is to open up Task Manager and see how much CPU cycles is FahCore_17 using (exception is at the starting of the WU and during checkpoints). If it generally uses very little CPU cycles, you can use all CPUs for CPU folding. If you notice that it is using a significant amount of CPU cycles, then a free CPU would be sufficient for it.

Thanks for the info!

PantherX wrote: Okay, so you were folding on both GPUs and the CPU, thus, had three active folding slots when the issue occurred. This seems to the initial issue where 2 GPUs within the same system wouldn't fold, is that right?

Yes that is right. This is the same situation it has always been but much much worse. I have not seen things fail repeatedly like that literally within seconds of startup. Makes me wonder about the GPU but to see if it is the GPU I'll swap it with the GPU running now to see if the "suspect" GPU will run alone. In all honesty I expect it to run fine as the lone GPU.

I also had a strong suspicion about the WU itself since it seemed to be simply impossible to process it so I scanned my other machines to see if a WU from project 10466 was running anywhere. Sure enough I see that project being processed on a machine with two GPU's right now and it is processing fine. Which again makes me wonder if I have a GPU failing. I'll find out soon.

PantherX wrote:That is weird. Did you make sure that FAHClient was running since Advanced Control (AKA FAHControl) can only display those tabs once it is connected with FAHClient? Generally, manually editing the config.xml isn't recommended since a typo might cause issues. However, once you deleted the config.xml, FAHClient re-detected the hardware and created the matching Slots.

I didn't know FAHControl needed FAHClient. In fact, I thought they were one in the same because I don't launch the FAHClient directly. I have seen what I described before though when I had an issue where Windows seemed to lose a GPU. I had the same issue and FAHControl would appear to hang. I can't remember if it said updating or connecting but it never seemed to get anywhere.

I guess my question would be why would the FAHClient not launch as it always does when I launch FAHControl? It is something I can look for in the future though.

PantherX wrote:That is good to hear that your CPU and GPU carried on folding WUs which were primarily assigned to it. Regarding the 2 GPU's WU, it will be either discarded or will be shifted to GPU 1's Slot once the current WU finishes. Do note that Project 10466 to Project 10469 were recently released to full F@H (viewtopic.php?f=24&t=26459).

In this case it looks like it was discarded. At the pace it was on with that machine it would have taken 10,000 years to finish anyway.

PantherX wrote: Okay, so with a single GPU, you have been folding without issues on the CPU and GPU simultaneously? If so, that sounds good. Hopefully, it will continue like this for a long time.

That's correct. I ran tests before on a different machine where I ran one GPU only and then switched. They each ran fine. After that I added the CPU's back to folding because I only removed them to see if things became more stable. They didn't, so I added them back in. 5 of the 6 machines I'm still running 2 GPU's and a CPU. I also dropped the slider to Medium to see if it would make a difference. The machines appear more stable but they still have a few issues a day spread among them. If I dropped to a single GPU and used the CPU for folding I'm sure I would run without incident and that may be what I end up doing.

Right now I'm still trying to find a way to make what I have work but I think I've reached a point where it is safe to say running 2 GPU's in the 6 machines I've been talking about will never be smooth. The question is can I find a way to keep more than 3 of them running for 24 hours a time because if I can I'll still do better than I would if I just ran 6 GPU's in those machines total. It appears I can generally keep 4-5 machines running with 2 GPU's for a 24 hour period of time. So it is a net gain in PPD and WU completed. I just don't know if that will be maintainable when I'm not available as much.

I can use TeamViewer to remote in to the machines via my phone but as I've mentioned often times the machines become so corrupt after a driver failure that I have to hit the reset button. If I catch it soon enough the machine will reboot fine on its own but if the video driver crashes multiple times the machine is not able to reboot on it's own. So we will see how it goes. If it isn't doable then I'll just have to sell some GPU's.

PantherX wrote: BTW, this is only 1 out of 6 systems. Are the other 5 systems folding fine with 1 GPU or are you only testing on a single system and once satisfied, will make the appropriate changes to the other 5 systems?

See the post above this regarding the current state of my machines, but to be very clear I don't expect to actually get the systems running smooth at this point if I use two GPU's. I have now seen that running with a single GPU on a system even with CPU folding enabled is solid and would be incident free. I may even be able to turn up the clock speeds on the GPU if I only used one. So that is something I need to look at.

Someone in the EVGA forum told me they had another folder that ran into a similar situation where it just appeared to be impossible to get two GPU's to fold consistently with the persons hardware. Apparently that person stopped folding after fighting with it for awhile and not being able to find a solution. They want me to try overvolting my CPU and the PCI-E lanes, but I ran into the stability issues even with the CPU folding disabled so it would have to come down to the bus lanes if that were the case. I need to look in the BIOS to see if I can even find settings to do it but I'll try it if I can.

In any event, I intend to keep folding. I have no intention of going back to "mining" even if the hardware can do it.

Edit: So I wrote an explanation as to what I was doing and I was going to try to keep two GPU's per system running and right after I posted the message I did a check of my machines and 4 out of 6 machines were down. LOL See I think the machines just like messing with me.

In case you are wondering the machine that is using one GPU was still working flawlessly. So it is looking more and more likely that I'll drop them all to one GPU just so I don't have to check every 10 minutes. Of course I would prefer to keep the machines using two GPU's but it just doesn't seem possible if I want to keep my sanity.

On the machine that is now running one GPU I set the clock speeds of the GPU back to it's defaults which were Core: 1040, Memory 1300. It was running at Core: 947, Memory 1000. I'll see if the GPU is still running in the morning.

PantherX wrote: Unless Intel has changed stuff, it seems that you don't have OpenCL packages installed. Since I have installed it, this is what appears in the Programs and Features list:
Intel® SDK for OpenCL - CPU Only Runtime Package

Yeah I do not have that entry in my programs list so I think it is safe to assume I don't have this installed.

Post by **PantherX** » Mon Jun 16, 2014 12:51 pm

ChasingTheDream wrote:...I also had a strong suspicion about the WU itself since it seemed to be simply impossible to process it so I scanned my other machines to see if a WU from project 10466 was running anywhere. Sure enough I see that project being processed on a machine with two GPU's right now and it is processing fine...

This is always a possibility that you got a bad WU (viewtopic.php?f=19&t=16526). In that case, there is nothing that you can do about it.

ChasingTheDream wrote:...I didn't know FAHControl needed FAHClient. In fact, I thought they were one in the same because I don't launch the FAHClient directly. I have seen what I described before though when I had an issue where Windows seemed to lose a GPU. I had the same issue and FAHControl would appear to hang. I can't remember if it said updating or connecting but it never seemed to get anywhere.

I guess my question would be why would the FAHClient not launch as it always does when I launch FAHControl? It is something I can look for in the future though...

There are three components of V7:
FAHClient -> It manages all the FahCores and transmission of WUs. On Windows, with default settings, it automatically starts-up when the user logs in.
FAHControl -> It connects to FAHClient and calculates the PPD/TPF, etc and provides a GUI to control FAHClient
FAHViewer -> It connects to FAHClient and shows what protein is being worked upon if possible.

In your case, you can only focus on FAHClient and FAHControl. If FAHClient is not running, then FAHControl will display "connecting". If you look at the startup details, you will notice "hideconsole.exe" (or something similar) which means that FAHClient will automatically start-up when the user logs-in. This is the default setting. In the Task Manager, you should see an entry FAHClient.exe whenever it is running. The next time you see that FAHControl isn't displaying information, look for FAHClient.exe in the Task Manager to ensure that it is running.

ChasingTheDream wrote:...I also dropped the slider to Medium to see if it would make a difference. The machines appear more stable but they still have a few issues a day spread among them...

IIRC, Medium only means that GPUs will fold when the system is idle (http://folding.stanford.edu/home/faq/fa ... ion/#ntoc3) and CPU folding will occur with the configured number of CPUs. In your case, it shouldn't really make a difference since all your systems are dedicated to folding. BTW, are those systems headless or not?

ChasingTheDream wrote:...Right now I'm still trying to find a way to make what I have work but I think I've reached a point where it is safe to say running 2 GPU's in the 6 machines I've been talking about will never be smooth. The question is can I find a way to keep more than 3 of them running for 24 hours a time because if I can I'll still do better than I would if I just ran 6 GPU's in those machines total. It appears I can generally keep 4-5 machines running with 2 GPU's for a 24 hour period of time. So it is a net gain in PPD and WU completed. I just don't know if that will be maintainable when I'm not available as much...

You can use Task Scheduler to reboot the system after every 6 or 12 hours. Maybe it might help you run those systems unattended.

ChasingTheDream wrote:...I can use TeamViewer to remote in to the machines via my phone but as I've mentioned often times the machines become so corrupt after a driver failure that I have to hit the reset button. If I catch it soon enough the machine will reboot fine on its own but if the video driver crashes multiple times the machine is not able to reboot on it's own...

Oh, I was going to suggest the reboot option present in TeamViewer but if you can't connect to it then it's not useful at all.

ChasingTheDream wrote:...Someone in the EVGA forum told me they had another folder that ran into a similar situation where it just appeared to be impossible to get two GPU's to fold consistently with the persons hardware. Apparently that person stopped folding after fighting with it for awhile and not being able to find a solution. They want me to try overvolting my CPU and the PCI-E lanes, but I ran into the stability issues even with the CPU folding disabled so it would have to come down to the bus lanes if that were the case. I need to look in the BIOS to see if I can even find settings to do it but I'll try it if I can...

It has been quite a while since I came across incompatible hardware. Unfortunately, the root cause hasn't been identified yet.

ChasingTheDream wrote:...In any event, I intend to keep folding. I have no intention of going back to "mining" even if the hardware can do it...

Kudos to you for not giving up!

ChasingTheDream wrote:...So I wrote an explanation as to what I was doing and I was going to try to keep two GPU's per system running and right after I posted the message I did a check of my machines and 4 out of 6 machines were down. LOL See I think the machines just like messing with me. ...

I guess the systems intercepted your post and decided to show you who's in control of the systems... at least you have systems with A.I.

Folding Forum

General Troubleshooting ideas

Re: General Troubleshooting ideas

Re: General Troubleshooting ideas

Re: General Troubleshooting ideas

Re: General Troubleshooting ideas

Re: General Troubleshooting ideas

Re: General Troubleshooting ideas

Re: General Troubleshooting ideas

Re: General Troubleshooting ideas

Re: General Troubleshooting ideas

Re: General Troubleshooting ideas

Re: General Troubleshooting ideas

Re: General Troubleshooting ideas

Re: General Troubleshooting ideas

Re: General Troubleshooting ideas

Re: General Troubleshooting ideas