Page 5 of 7
Re: work units not sending. Also, idle not working properly
Posted: Sun May 25, 2014 1:12 pm
by 7im
Volnaiskra wrote:My 'unsent' cores stacked up again, resulting in FAH greying out and locking. I forcibly closed it, deleted all of the work folders, restarted PC, and it's right as rain again. Since there doesn't seem to be any known solution to my problem, I guess I'll just get in the habit of purging the work folders from time to time. But thanks for all your help, guys.
But I'd also like to report this as a bug. Because regardless what's causing the "acknowledging WU received" packets to be lost, I'm sure FAH could deal with such situations better, rather than getting confused and breaking down each time. I'm no software developer, but surely one or more of the following methods could feasibly prevent these situations becoming so problematic:
1. if there's some sort of limit to how many WUs FAH can handle at one time before bugging out, then raise the limit
2. Periodically check for WUs that have been 99%+ sending for more than several days. When found, restart the upload process
3. Periodically check for WUs that have been 99%+ sending for more than a several days. When found, check against the database to see whether they were actually accepted
4. Periodically check for WUs that have been 99%+ sending for more than a week or so. When found, move them into some sort of quarantine area where they won't conflict with regular FAH operations
5. Periodically check for WUs that have been 99%+ sending for more than a week or so. When found, treat them as losses and delete their work files.
Is there somewhere I can post this as a bug/feature request? Or could one of you guys do it for me?
When a problem only happens on one computer, the problem is with that computer, not with fah. If this was a fah problem, we would have many reports of this problem from many different donors.
Reporting this to fah will not help fix your computer.
Re: work units not sending. Also, idle not working properly
Posted: Sun May 25, 2014 9:44 pm
by bruce
As PantherX has suggested, items 2,3,4 are designed as follows:
The client is totally responsible for those decisions. When a acknowledgement is received from the server, the local copy of the result is deleted. Otherwise the local copy is saved and will be retried later. The time between retries increases exponentially to keep from overloading a server that has recently recovered from an outage. The local copy is deleted if it expires and will no longer try to upload.
As the number of simultaneous uploads increases, additional server resources must be allocated. The limit to that number is designed to prevents service degradation beyond a certain point due to running out of those resources. The actual number varies, depending on the server hardware.
Re: work units not sending. Also, idle not working properly
Posted: Sun May 25, 2014 10:32 pm
by Volnaiskra
7im wrote:
When a problem only happens on one computer, the problem is with that computer, not with fah. If this was a fah problem, we would have many reports of this problem from many different donors.
Reporting this to fah will not help fix your computer.
You've misunderstood. There are two totally separate issues at play. One is the problem of acknowledgements routinely going missing. That problem is particular to my system.
The second problem is what FAH does once there is a number of accumulated missed acknowledgements. At the moment, it shits itself and stops working properly. I doubt that is particular to my system. I'd bet money that it would be recreatable on anyone's system that managed to get enough missed acknowledgements. There could be many reasons for getting missed acknowledgements (bad internet connection, bad server, firewall, corrupt packet, ISP filtering, cleaning lady tripping over the ethernet cable), and they are all irrelevant. The issue is how the FAH software deals with such an eventuality when/if it happens. At the moment, it does not seem to understand what is happening with its own WU queue, and consequently breaks down. I'm suggesting that it could be improved so that it better understands its own work queue and doesn't get routinely tripped up by this remarkably simple, albeit rare, situation.
PantherX wrote:By chance, do you remember how many WUs you had in your queue before it greys out? Is it the same number always?
1) Assuming that the number of WUs in your queue is causing the issue, you could change this setting:
Code: Select all
max-queue <integer=16>
Maximum units per slot in the work queue.
Yes, it always seems to be the same amount, though I haven't usually counted it. I think I deleted 10 work folders last time.
Where do I change that setting: in the config.xml, or in advanced configuration in FAHcontrol? And do you know the maximum value?
2) AFAIK, if the acknowledgement isn't received, it will automatically attempt to upload next time. Does this not happen in your case?
No, definitely not. If there is meant to be such a feature, then it doesn't seem to be working properly. I've never seen more than one attempt at an upload in the logs.
4) This is already present to a certain extent. If the WU is still present and the expiration deadline is reached, it is deleted from the queue. You just have to wait long enough for the WU to expire.
I figured that would be the case, though it doesn't help in my situation, since the WUs accumulate much faster than they expire.
Re: work units not sending. Also, idle not working properly
Posted: Mon May 26, 2014 12:51 am
by bruce
Volnaiskra wrote:
PantherX wrote:By chance, do you remember how many WUs you had in your queue before it greys out? Is it the same number always?
1) Assuming that the number of WUs in your queue is causing the issue, you could change this setting:
Code: Select all
max-queue <integer=16>
Maximum units per slot in the work queue.
Yes, it always seems to be the same amount, though I haven't usually counted it. I think I deleted 10 work folders last time.
Where do I change that setting: in the config.xml, or in advanced configuration in FAHcontrol? And do you know the maximum value?
Yes, it is in FAHControl and/or config.xml. They're the same thing. We always recommend that you use FAHControl to modify the settings stored in config.xml rather than editing the file directly. The client will use the default value unless you set it to a different value.
I'm not aware of a maximum value, but all that will do is repeatedly send more WUs over and over. The fundamental problem is that the confirmation message is not reaching the client, and nothing you can do inside of FAH will fix that problem. Once that problem is cured, I'm confident that the other issues will be resolved, as well.
What permission settings are associated with the work files? Are these on Windows or OS-X? Was your installation using the default settings "for me only" or did you install "for everbody"? (By default, a WIndows installation will place the files in %APPDATA%\fahclient)
Re: work units not sending. Also, idle not working properly
Posted: Mon May 26, 2014 2:37 am
by 7im
I understood just fine. There is no one else reporting this issue, no one else is losing enough acknowledgements for the client to shit itself. With so many other issues to fix yet, this will get a very low priority after I report it to the developers. With limited development resources, bigger bugs get squashed first.
Re: work units not sending. Also, idle not working properly
Posted: Mon May 26, 2014 7:35 am
by Volnaiskra
bruce wrote:I'm not aware of a maximum value, but all that will do is repeatedly send more WUs over and over. The fundamental problem is that the confirmation message is not reaching the client, and nothing you can do inside of FAH will fix that problem. Once that problem is cured, I'm confident that the other issues will be resolved, as well.
I'm just after 'damage minimisation' at the moment. Currently, it takes about a week for FAH to lock up, so I'd have to remember to purge the work folders once a week. If I can raise the maximum value and make it so I only have to purge them once a month, that would be a win.
What permission settings are associated with the work files? Are these on Windows or OS-X? Was your installation using the default settings "for me only" or did you install "for everbody"? (By default, a WIndows installation will place the files in %APPDATA%\fahclient)
Windows7-64 Home. I'm the only user (and hence administrator) of the OS.
I've tried both installation types previously, but currently I'm on the "for me only", which has placed it in %APDATA%\
roaming\fahclient.
7im wrote:I understood just fine. There is no one else reporting this issue, no one else is losing enough acknowledgements for the client to shit itself. With so many other issues to fix yet, this will get a very low priority after I report it to the developers. With limited development resources, bigger bugs get squashed first.
I never said I expected it to be made a high priority. I just pointed out that there appears to be a design flaw or bug, and that it should probably get recorded so the developers can deal with it as they see fit.
Re: work units not sending. Also, idle not working properly
Posted: Mon May 26, 2014 8:25 am
by davidcoton
I also have seen the lost ack and subsequent failure to retry, though on isolated WUs, not every time. In my case a restart of FAH will cause a retry (usually producing "GOT ALREADY", confirming that it is the Ack that went AWOL). I agree this seems to be a bug in the error recovery, which is secondary but "ought" to be fixed. Ticket #983 (major) relates to similar failures to retry on download, but suggests there is no upload problem. Perhaps the ticket needs updating?
David
Re: work units not sending. Also, idle not working properly
Posted: Mon May 26, 2014 9:00 am
by Volnaiskra
bruce wrote:
Yes, it is in FAHControl and/or config.xml. They're the same thing. We always recommend that you use FAHControl to modify the settings stored in config.xml rather than editing the file directly. The client will use the default value unless you set it to a different value.
just like this?

Re: work units not sending. Also, idle not working properly
Posted: Mon May 26, 2014 11:36 am
by PantherX
Volnaiskra wrote:...Yes, it always seems to be the same amount, though I haven't usually counted it. I think I deleted 10 work folders last time...
If you can provide us with the exact steps on how to replicate this issue, a bug report can be opened. If the Work Queue is full, it should still be visible in Advanced Control (AKA FAHControl) and Web Control. Do make sure that you are using V7.4.4 and that the steps ensure that the issue can be recreated on-demand.
Volnaiskra wrote:...No, definitely not. If there is meant to be such a feature, then it doesn't seem to be working properly. I've never seen more than one attempt at an upload in the logs...
Thanks for informing us. I have added this information to Ticket 983 (
https://fah.stanford.edu/projects/FAHCl ... comment:14).
Volnaiskra wrote:...just like this?

Yep, that should do it. You can make sure by viewing the initial section of the log file which will contain the latest configuration. I am not sure but if this doesn't work, maybe you would have to restart FAHClient for the effect to take place.
Re: work units not sending. Also, idle not working properly
Posted: Mon May 26, 2014 7:06 pm
by bruce
davidcoton wrote:I also have seen the lost ack and subsequent failure to retry, though on isolated WUs, not every time. In my case a restart of FAH will cause a retry (usually producing "GOT ALREADY", confirming that it is the Ack that went AWOL). I agree this seems to be a bug in the error recovery, which is secondary but "ought" to be fixed. Ticket #983 (major) relates to similar failures to retry on download, but suggests there is no upload problem. Perhaps the ticket needs updating?
David
Those are different things.
1) If your internet connection had a transient problem, FAH should time-out, kill the current ([up-/down-]load and restart that data transfer rather that remaining permanently hung with the same (up-/down-)load. That problem is addressed by the ticket. (That's a failure-to-retry issue.)
2) If your connection is simply unreliable, dropping an occasional acknowledgement, the client SHOULD try to upload the WU again and wait for the message to be received.
3) If
all acknowledgement messages are being filtered by something in your router or security software (or whatever), I contend that the bug is not part of FAH. Development cannot (and should not) try to fix this problem. You have to unblock your security software so they won't be filtered. The FAH server can do nothing to bypass that filter.
Re: work units not sending. Also, idle not working properly
Posted: Thu May 29, 2014 12:39 am
by Volnaiskra
bruce wrote:3) If all acknowledgement messages are being filtered by something in your router or security software (or whatever), I contend that the bug is not part of FAH. Development cannot (and should not) try to fix this problem. You have to unblock your security software so they won't be filtered. The FAH server can do nothing to bypass that filter.
No, in this situation, the bug wouldn't be part of FAH. But how FAH reacts to blocked acks - no matter what circumstances they occur in -
is the responsibility of FAH. If you wear your new shirt outside, and the rain makes the colours run and ruins the shirt....do you blame the rain or the shirt? It seems to me that if there's a known Achilles heel in a piece of software, a developer should want to fix it if possible.
By the way, just for clarification: I've never stated that ALL of my acks go missing. It's only some of them.
Re: work units not sending. Also, idle not working properly
Posted: Thu May 29, 2014 1:00 am
by 7im
Please post a list of what projects work, and which ones stack up. A list of the specific work units (PRCG numbers) also, if possible.
Re: work units not sending. Also, idle not working properly
Posted: Thu May 29, 2014 4:16 am
by bruce
Volnaiskra wrote:But how FAH reacts to blocked acks - no matter what circumstances they occur in - is the responsibility of FAH. If you wear your new shirt outside, and the rain makes the colours run and ruins the shirt....do you blame the rain or the shirt? It seems to me that if there's a known Achilles heel in a piece of software, a developer should want to fix it if possible.
I agree: The developer should want to fix it. I contend that it's not possible. Perhaps you can suggest how to do that given the information that the client has access to.
After your client sends a project. either and acknowledgement is 1) received or it is 2) not received.
1) If it received and the acknowledgement reply is received, the client discards the local copy and does not resend it.
2A) If the upload is incomplete or otherwise not successfully completed, the server waits for the upload to finish and eventually it gives up and resets the connection. There will be no acknowledgement. The client SHOULD attempt to send it again.
2B) If the server receives the WU, acknowledges it, but the acknowledgment is blocked, the client sees the same result as 2A -- no acknowledgement. It has to respond as it does in 2A.
The client cannot decide WHY the acknowledgement is missing. It only that it didn't get it. How do you propose to program the client distinguish between 2A and 2B?
Re: work units not sending. Also, idle not working properly
Posted: Fri May 30, 2014 5:55 am
by Volnaiskra
bruce wrote:The client cannot decide WHY the acknowledgement is missing. It only that it didn't get it. How do you propose to program the client distinguish between 2A and 2B?
We seem to be talking about two different things.
I don't see why the client should need to distinguish between 2A and 2B at all. It just needs to stop locking and freezing when too many unconfirmed acks have stacked up. Whether those unconfirmed acks are from 2A or 2B is irrelevant. In fact, since 2B doesn't even really exist as far as FAH is concerned, the problem really is just in identifying why
FAH malperforms when too many 2As build up.
Re: work units not sending. Also, idle not working properly
Posted: Fri May 30, 2014 6:21 am
by bruce
Yep. We're talking about two different things.
In my mind, a WU should upload and receive an acknowledgement within a very few retries unless both the Work Server and the Collection Server are both down. Therfore, the number of unsent WUs never gets to be a large number.
Suppose your connection is so unreliable that 10 WUs accumulate. If you restart your client, I suspect that all 10 of them will attempt to share the connection by trying to upload concurrently. (I never have seen more that 2 WUs so I'm not sure what actually happens with 10. Perhaps only the first two or three share the bandwidth.) That's a good design if you have a connection that is reliable and fast. If you have a connection that's unreliable and slow, it's best to attempt to upload the WUs serially even though the total time spend uploading will increase. There may already be a ticket to enhance the client to upload serially.