Page 1 of 1

Problems with 155.247.166.220?

Posted: Thu Oct 01, 2015 1:49 pm
by patermann
I have two slots, one of which is assigned to work server 155.247.166.220 . It finished a work unit yesterday but failed to upload it to the work server, so it sent it to the collection server 155.247.166.219 instead. However, since then, it has been sitting idle, doing nothing, and there are no messages for that slot in the log after the work unit was uploaded. (FAHClient shows the slot as "Ready" and status as "Download" but everything else is 0/Unknown and it has been that way for more than 24 hours now.) The other slot is assigned to work server 171.64.65.99 and is folding away merrily. Everything looks ok on the server status page so I do not understand why it is not working. Any ideas?

Thanks,
patermann

Re: Problems with 155.247.166.220?

Posted: Thu Oct 01, 2015 4:17 pm
by Joe_H
There is a known problem where the client sometimes fails to detect and then retry a failed download or upload attempt. Not a frequent issue, but once this happens the only known way to get out of this is to restart the FAHClient process. Easiest way to do this is a system reboot, or on Windows stop and restart the FAHClient process from task bar icon after pausing folding. Restarting the FAHClient process without a reboot is a bit more involved on Linux and OS X.

Re: Problems with 155.247.166.220?

Posted: Thu Oct 01, 2015 5:27 pm
by bruce
Server 155.247.166.220 seems to be operating normally, so you must have encountered the bug Joe_H is talking about.

Re: Problems with 155.247.166.220?

Posted: Mon Feb 15, 2016 1:43 pm
by Ricky
Some how I have developed this problem with one of my two windows machine. I don't necessarily believe it is having to do with any particular server. If I reboot, I can go about a day running multiple WUs on the CPU and two GPU slots. After about a day of folding, the CPU slot will stall on a download, and never restart without a reboot. This problem started about a week ago.

When I reboot, one GPU work unit gets dumped most of the time for bad platform size. I believe this is from the client not reassigning the right WU to the right GPU on restart. One card is a GTX980 and the other is a GTX980ti. So, I have to finish the GPU work units before reboot to guarantee there is nothing to dump. This issue reduces the folding efficiency of the computer substantially. I would be best to give up on folding the CPU slot if I can't resolve the stalled download problem with the CPU slot.

UPDATE:

I had a GPU stall a download while the CPU still had its stalled download. I turned off the Widows firewall. The GPU immediately started its download and completed it. I then rebooted to get the CPU to download, but lost the GTX980ti work unit. There are two copies of FAH listed in the firewall monitor. One was private and the other was public. I have now set the FAHcontrol in the firewall to both pubic and private.

Re: Problems with 155.247.166.220?

Posted: Mon Feb 15, 2016 4:52 pm
by bruce
Ricky wrote:I believe this is from the client not reassigning the right WU to the right GPU on restart. One card is a GTX980 and the other is a GTX980ti.
That's not what's happening. The GTX980 and the GTX980 ti are both classified as GPU type 2, subtype 5 (see GPUs.txt) so FAH treats those two GPUs identically. Even if they're reassigned as you suggest, the WU can be completed by the other GPU.

Internally, they are, in fact, different: a GM204 and a GM200 but the drivers present the same OpenCL 1.2 interface so FAH doesn't distinguish between the two of them. What is not clear is what internal differences might be found within the drivers, but that's a question for NVidia. (I expect that they also present the same CUDA interface, but I have not confirmed that. Not that it matters, though since it isn't currently being used by the FAHCores.)

UPDATE: When a firewall misbehaves, it can get tricky to fix it -- other than moving to a different firewall and hope it doesn't act the same way.