Page 2 of 3
Re: 171.64.65.56
Posted: Sun Aug 15, 2010 7:57 pm
by weedacres
The last few days have been particularly bad.
Between 171.67.108.25 and 171.64.65.56 getting 503's and only occasionally completing an upload, and -send rarely working I've had to do a lot of restarts to get anything uploaded.
Re: 171.64.65.56
Posted: Sun Aug 15, 2010 9:05 pm
by Grandpa_01
Yes they are having some pretty bad issues with this server most of the -smp WU's I complete take between 1 and 3 hours from the time I complete them for the server to accept them. I am sure Stanford will be doing something about the issue soon
after all the purpose of the smp and bonus points was to encourage us to get them back as fast as possible. I don't think this hurry up and wait thing was part of the plan.
Re: 171.64.65.56
Posted: Sun Aug 15, 2010 9:23 pm
by Grandpa_01
bruce wrote:FlipBack wrote:My work unit has finally been returned successfully. After like 22 hours. Oh well, at least it is in. Sounds like the server just has a high load or something...
Something like that. The Net Load is around 200, which is probably the limit that it can handle. . . . too many people all hitting that same server at the same time.
I suppose everybody is running Langouste. Doesn't it increase the number of connections that each person has? The standard client uses one at a time but I think now folks are using three and the server can't handle it. One gets aborted but the server probably doesn't realize that, a second is started for an immediate download, and a third for the upload.
I looked at the server log after reading your comment about Langoueste. The Log only goes back to the 7th but the server has been pretty busy since then. If it is Langouste that is causing this problem perhaps they can figure out a way to get the ban hammer on those that are using it.
Re: 171.64.65.56
Posted: Sun Aug 15, 2010 10:59 pm
by bruce
Grandpa_01 wrote:I looked at the server log after reading your comment about Langoueste. The Log only goes back to the 7th but the server has been pretty busy since then. If it is Langouste that is causing this problem perhaps they can figure out a way to get the ban hammer on those that are using it.
I do not know if it's fair to call this the cause. It's strictly speculation on my part about what might have changed.
If it does happen to be the cause, it seems a shame that for the sake of saving a relatively small amount of time required to upload a WU before downloading a new one is worth causing the congestion that we see and the extra hassle that it's causing everyone.
By default, the client tries to upload once or twice when the WU is finished. If those attempts happen to fail, the client usually does not retry for 6 hours. Why was the client is designed that way? -- so that people do not contribute to an overloaded server. Under those conditions, I've seen reports of people repeatedly restarting their client (with or without -send all) to try to force their WU to upload. This DOES contribute to a server overload. Each person who intentionally attempts to jump to the front of the line not only makes others wait longer, but they add to the turbulence in the line, which slows others down even more. I can understand selfish behavior, but that doesn't mean I condone it.
Hopefully Stanford will stop assigning any new work from that server until most of the backlog of uploads can be accepted -- and then assign new work very sparingly. I have no idea if that will mean that folks won't be able to get their favorite project or not, but that might happen, but that would at least be a small improvement.
Re: 171.64.65.56
Posted: Mon Aug 16, 2010 12:25 am
by weedacres
From what I saw of Langoueste, it does not try to send data any more than the the client itself. If it can't upload it waits for the client to try again in 6 hours. It's only function from what I could see was to immediately download a new workunit while trying to upload the last. It uses the -send command to accomplish this, and with -send being so unreliable I ended up with more stuck work units than without it, so I stopped using it.
The problem is that often after 6 hours the server is no more able to receive the now getting aged files than it was when it was completed. PG's bonus system encourages people to get the data in as soon as it's completed. If they can't handle the load then perhaps they should change something.
Re: 171.64.65.56
Posted: Mon Aug 16, 2010 3:07 am
by bruce
weedacres wrote:From what I saw of Langoueste, it does not try to send data any more than the the client itself. If it can't upload it waits for the client to try again in 6 hours. It's only function from what I could see was to immediately download a new workunit while trying to upload the last. It uses the -send command to accomplish this, and with -send being so unreliable I ended up with more stuck work units than without it, so I stopped using it.
The problem is that often after 6 hours the server is no more able to receive the now getting aged files than it was when it was completed. PG's bonus system encourages people to get the data in as soon as it's completed. If they can't handle the load then perhaps they should change something.
The every 6 hours bit is only part of the problem. Doesn't it (A) Abort a connection (which the server probably waits an hour to close), (B) Start a download, and (C) Start an upload. By my count that's three connections whereas the original client would only be using one.
I'm not saying I'm sure of how it works, but it does open more connections, and if that's clogging up an over-committed server, it's a problem. Sure, you could do the same thing manually, but that is either of those actions contribute to congestion that really needs to be reduced. I have the same problem with people who restart their client unnecessarily.
I have no doubt that Stanford will figure out how to manage the server better, but that's probably not going to happen until tomorrow, and the real question is whether we, as a community, continue to fight over the over-committed resources or if we collectively decide to help each other between now and then.
Re: 171.64.65.56
Posted: Mon Aug 16, 2010 3:54 am
by Hyperlife
I'm sure tear will jump in here, but from my experience with it, Langouste doesn't appear to open any more connections than the standard client does.
The main client's initial connection to upload a completed WU never makes it to the server because Langouste interrupts it (by using a proxy port, I believe). The forked client then creates the first (and only) upload connection while the main client moves on to request a new WU. The request gets sent to an assignment server initially, and then to a work server which may or may not be the same work server that's receiving the WU from the forked client.
If the forked client fails to upload the WU, then Langouste does nothing more until the next WU is completed -- the main client will then retry every six hours as usual.
My guess is that Langouste is not causing the problem here. More likely it's from people manually restarting their clients instead of waiting for the six-hour retry -- they don't want to lose bonus points by waiting that long. They now have an incentive to hit the server over and over again for the faster return.
People who haven't actually used Langouste or looked at the source code shouldn't be speculating about its role in this problem.
Re: 171.64.65.56
Posted: Mon Aug 16, 2010 4:19 am
by weedacres
Hyperlife wrote:I'm sure tear will jump in here, but from my experience with it, Langouste doesn't appear to open any more connections than the standard client does.
The main client's initial connection to upload a completed WU never makes it to the server because Langouste interrupts it (by using a proxy port, I believe). The forked client then creates the first (and only) upload connection while the main client moves on to request a new WU. The request gets sent to an assignment server initially, and then to a work server which may or may not be the same work server that's receiving the WU from the forked client.
If the forked client fails to upload the WU, then Langouste does nothing more until the next WU is completed -- the main client will then retry every six hours as usual.
My guess is that Langouste is not causing the problem here. More likely it's from people manually restarting their clients instead of waiting for the six-hour retry -- they don't want to lose bonus points by waiting that long. They now have an incentive to hit the server over and over again for the faster return.
People who haven't actually used Langouste or looked at the source code shouldn't be speculating about its role in this problem.
Hyperlife's description matches what I've observed.
Re: 171.64.65.56
Posted: Mon Aug 16, 2010 4:31 am
by Grandpa_01
So from what I am reading in the descriptions of what Langouste is doing it is interrupting the upload attempt and making the connection.That is 1 connection per FAH client. Then the FAH client makes a connection to download the next WU that appears to be 2 connections. So if my math is right 100 people running Langouste will use as many connections as 200 people not using it. Somehow in my mind it seams Langouste could very easily be contributing to the problem.
Re: 171.64.65.56
Posted: Mon Aug 16, 2010 4:45 am
by Hyperlife
Grandpa_01 wrote:So from what I am reading in the descriptions of what Langouste is doing it is interrupting the upload attempt and making the connection.That is 1 connection per FAH client. Then the FAH client makes a connection to download the next WU that appears to be 2 connections. So if my math is right 100 people running Langouste will use as many connections as 200 people not using it. Somehow in my mind it seams Langouste could very easily be contributing to the problem.
Which is no different than a client running without Langouste. Every client will make two connections (actually 3, if you count the assigment server connection) during a WU-return-and-request cycle; a client running with Langouste merely makes the second (WU download) connection happen a few minutes earlier than before. The second connection is also completed in a few seconds, so the two connections are hardly equivalent in server processor and/or net load cost.
Langouste was released nearly a year ago in September 2009. Don't you think that every work server would have experienced high net load from day one if Langouste was the problem?
Re: 171.64.65.56
Posted: Mon Aug 16, 2010 5:54 am
by 7im
Server issues have been ongoing, one might even say since late 2009. And Langouste for Windows was just recently released, potentially increasing the problems we are seeing.
See Hyperlife, it's just as easy to infer that Langouste IS the problem as you were to infer it ISN'T.
I'm not taking sides. Unsupported conjecture is not helpful, for or against. Until someone can state authoritatively how Langouste does or does not make connections (and how many), and how Stanford's servers do or do not timeout those connections, we're not helping the situation by pointing fingers back and forth.
Gather facts first. Then if any finger pointing or defending needs to be done, it can be.
Re: 171.64.65.56
Posted: Mon Aug 16, 2010 12:27 pm
by toTOW
This is the server for p670x ... many people must be happy to net get them
Re: 171.64.65.56
Posted: Mon Aug 16, 2010 2:26 pm
by Hyperlife
7im wrote:Gather facts first. Then if any finger pointing or defending needs to be done, it can be.
I've used Langouste since soon after release, and I've looked at the source code. Have you?
Please follow your own advice. I've already done so.
Re: 171.64.65.56
Posted: Mon Aug 16, 2010 3:18 pm
by 7im
Hyperlife wrote:7im wrote:Gather facts first. Then if any finger pointing or defending needs to be done, it can be.
I've used Langouste since soon after release, and I've looked at the source code. Have you?
Please follow your own advice. I've already done so.
My own advice is to not point fingers, and I have followed that advice.
Since you seem to be an authority on Langouste, please break it down for us and describe how the connections are made, and in what order, to what parts of Stanford or local ports, etc. Thanks.
Re: 171.64.65.56
Posted: Mon Aug 16, 2010 3:23 pm
by AtwaterFS
Hyperlife wrote:
My guess is that Langouste is not causing the problem here. More likely it's from people manually restarting their clients instead of waiting for the six-hour retry -- they don't want to lose bonus points by waiting that long. They now have an incentive to hit the server over and over again for the faster return.
I think the problem is the whole bonus scheme - it's pretty obnoxious... It's a "thumbed nose" to part-time folders and in conjunction w/ buggy SMP3 has led to a LOT of aggravation.
Bonus scheme -> emphasis on faster results and full-time folding -> more results produced -> increased strain on infra -> downtime and pulled projects -> loss of points due to bonus system -> donor aggro...