171.64.65.56 not responding

Moderators: Site Moderators, FAHC Science Team

VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: 171.64.65.56 is in Reject status

Post by VijayPande »

ArVee wrote:.65.56 is in Reject mode again. I really have to ask how many times this needs to be reported before it sinks in that there may be a workload or balancing problem. I mean c'mon, this is beyond ineptitude and right into ridiculous. Why don't you just get to it and address this with an eye to a permanent solution?
This issue has been answered in another thread as well as a recent blog post (giving the full reason, timeline for update, etc). I can understand your frustration here, and I suggest that you check out the details of our plan:

http://folding.typepad.com/news/2008/11 ... asily.html
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
codysluder
Posts: 1024
Joined: Sun Dec 02, 2007 12:43 pm

Re: 171.64.65.56 is in Reject status

Post by codysluder »

VijayPande wrote:...The SMP and GPU servers will get it last, since they need additional code to bring the v5 code path up to spec with what the GPU and SMP needs. However, we expect this won't be too onerous to get done.
The high performanance projects will get fixed last and they're the ones with the shortest deadlines so logically they should be first. How about setting up (or reserving) additional Collection Server resources so that the high performance WUs can be returced promptly even after the old binaries hang? The uniprecessor loads on the collection servers may not be huge, but it's easier for them to handle additional delays.

Also, how about some small SMP projects for those of us on dial-up?
AgrFan
Posts: 63
Joined: Sat Mar 15, 2008 8:07 pm

Re: 171.64.65.56 is in Reject status

Post by AgrFan »

ArVee wrote:.65.56 is in Reject mode again. I really have to ask how many times this needs to be reported before it sinks in that there may be a workload or balancing problem. I mean c'mon, this is beyond ineptitude and right into ridiculous. Why don't you just get to it and address this with an eye to a permanent solution?
I'd suggest running the Windows SMP client until more work is available from this server. 171.64.65.64 is fully functional and has plenty of work available.
VijayPande wrote:This issue has been answered in another thread as well as a recent blog post (giving the full reason, timeline for update, etc). I can understand your frustration here, and I suggest that you check out the details of our plan:

http://folding.typepad.com/news/2008/11 ... asily.html
I really don't understand how faulty server code is the answer to this problem. This server was fully functional for quite some time when the 2605 WUs were readily available.

The big questions are a) why is this server low on work, b) why does work not get uploaded to a collection server when this server goes down, and c) why can't a temporary fix be implemented in the low-level HTTP code/library to stop the binaries from hanging when this server gets overloaded?
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 171.64.65.56 is in Reject status

Post by bruce »

AgrFan wrote:The big questions are a) why is this server low on work, b) why does work not get uploaded to a collection server when this server goes down, and c) why can't a temporary fix be implemented in the low-level HTTP code/library to stop the binaries from hanging when this server gets overloaded?
All servers make new WUs from the results that are returned. If a server is overloaded, it may be able to assign all the WUs (small data transfers) while not being able to accept the large data transfers associated with uploads. Also, projects do end. Then it takes human thought to learn what an old project told them and devise a project to answer the new questions that were discovered. Then new projects must be prepared and tested before they can be distributed. When an individual server is overloaded or down, there usually is redundancy provided by assigning work from other servers.

The collection servers are running at maximum capacity. I'm not sure when new server capacity will come on-line or how it will be allocated.

Vijay said "likely in the low-level HTTP code/library" which means they have not been able to fully identify why the binary hangs so they also don't know exactly what to fix. Older versions of Linux contain bugs that are fixed in newer versions so the best approach is to upgrade to a new version. (If anyone knows how to do the suggested temporary fix, let us know.)

BTW, the server appears to be running fine right now with a reasonably light load.
AgrFan
Posts: 63
Joined: Sat Mar 15, 2008 8:07 pm

Re: 171.64.65.56 is in Reject status

Post by AgrFan »

bruce wrote: All servers make new WUs from the results that are returned. If a server is overloaded, it may be able to assign all the WUs (small data transfers) while not being able to accept the large data transfers associated with uploads. Also, projects do end. Then it takes human thought to learn what an old project told them and devise a project to answer the new questions that were discovered. Then new projects must be prepared and tested before they can be distributed. When an individual server is overloaded or down, there usually is redundancy provided by assigning work from other servers.
It sounds like many of the 26xx projects may be ending soon with new projects coming online in the near future. This would explain the lack of work on this server. It would be helpful if Stanford gave periodic status updates of the projects being serviced by this server, specifically the A2 units since they are the units people are the most interested in for the higher point production. If it was known that A2 units are getting scarce, donors may be willing to run other clients temporarily until more A2 units are available. I stopped running the CPU client for the bonus Amber units when Vince Voelz posted in the forum that those projects were soon coming to an end and their priority was going to be lowered. I know Stanford doesn't want to regularly publicize this kind of information but as a donor it does make us feel like our contributions are valued and adds a nice touch to the whole folding experience.
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: 171.64.65.56 is in Reject status

Post by kasson »

The 26xx projects are far from monolithic (they represent a fairly broad range of different scientific projects that I'm working on--some are A2 and some are not). Looking specifically at projects 2668,2669,2671,2673, and 2675, we have obtained a number of interesting results and will certainly provide details once the papers complete scientific review. We do anticipate more projects in this series, but it is always difficult to predict such scientific directions in advance.
WickedPixie
Posts: 6
Joined: Mon Jan 21, 2008 5:40 pm

Re: 171.64.65.56 is in Reject status

Post by WickedPixie »

kasson wrote: Looking specifically at projects 2668,2669,2671,2673, and 2675, we have obtained a number of interesting results and will certainly provide details once the papers complete scientific review. We do anticipate more projects in this series, but it is always difficult to predict such scientific directions in advance.

According to the Psummary description,
These projects study how influenza virus recognizes and infects cells. We are developing new simulation methods to better understand these processes.
Are there any SMP projects, or Uniprocessor projects for that matter, doing AD & PD research?

I'd be glad to switch to Uniprocessor projects if it has something to do with AD & PD research...
bollix47
Posts: 2957
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: 171.64.65.56 is in Reject status

Post by bollix47 »

Tue Dec 16 14:40:21 PST 2008 171.64.65.56 SMP vspg4 kasson full Reject 2.14 63 12 17883 1877 0 2.01 144 144
Image
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: 171.64.65.56 is in Reject status

Post by kasson »

It restarted at 14:52 and is now functioning.
bollix47
Posts: 2957
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: 171.64.65.56 is in Reject status

Post by bollix47 »

Thanks for the update.

It seems a pity when a WU that only takes 6 hours to complete has to wait another 2-3 hours (or 6 without intervention) to upload because a server has a problem and the collection server has no record of the WU.

Makes it difficult to comply with Pande Group's desire for setting up our computers to return WUs as fast as possible.

It seems that the auto restart for servers takes 2-3 hours to kick in. If that is the case perhaps the client's autosend feature could be changed from 6 hours to 3, at least with the SMP and GPU clients where a fast turnaround is expected?

Hopefully we'll see less of this as the server code is updated.
Image
314159
Posts: 232
Joined: Sun Dec 02, 2007 2:46 am
Location: http://www.teammacosx.org/

Re: 171.64.65.56 is in Reject status

Post by 314159 »

171.64.65.56 is once again in Reject status.

Help!

Thanks,

John
John (from the central part of the Commonwealth of Virginia, U.S.A.)

A friendly visitor to what hopefully will remain a friendly Forum.
With thanks to all of the dedicated volunteers on the staff here!!
JadeMiner
Posts: 3
Joined: Tue Jul 22, 2008 9:27 am

Re: 171.64.65.56 is in Reject status

Post by JadeMiner »

171.64.65.56 SMP vspg4 kasson full Reject

Looks like this server is in reject status again.
314159
Posts: 232
Joined: Sun Dec 02, 2007 2:46 am
Location: http://www.teammacosx.org/

Re: 171.64.65.56 is in Reject status

Post by 314159 »

Hey JadeMiner,

Looks as if we were posting simultaneously. :)

The GOOD NEWS is that this server came back up about 2 minutes later (an early Xmas Gift, I guess).

I had another machine successfully complete, send, and receive a WU from this server just moments ago.

Thanks to whomever fixed this! (if it was a software-reset, thanks to whomever coded that). 8-)

An early Season's Greetings to ALL; especially to our good and dedicated friends at the Pande Group.

John
John (from the central part of the Commonwealth of Virginia, U.S.A.)

A friendly visitor to what hopefully will remain a friendly Forum.
With thanks to all of the dedicated volunteers on the staff here!!
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: 171.64.65.56 is in Reject status

Post by VijayPande »

bollix47 wrote: Hopefully we'll see less of this as the server code is updated.
Yes, further rolling out the new server code is a very high priority for 2009 (hopefully in January, but that will depend on QA for GPU2 and SMP server codes).
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
314159
Posts: 232
Joined: Sun Dec 02, 2007 2:46 am
Location: http://www.teammacosx.org/

Re: 171.64.65.56 is in Reject status

Post by 314159 »

This server has been relatively reliable recently. :)

It was in REJECT MODE for close to an hour but that had apparently been corrected a few minutes ago. (THANK YOU!)

My question:

While the server no longer shows REJECT and is pingable, a column on the serverstats page labeled "NMJ" is presently colored blue and has a "1" in it. This was the only server in this condition when I last looked.

May I ask what this "NMJ" status means? - hopefully not "no more jobs" - :( <--sad - not mad :)

Thank you.

John
John (from the central part of the Commonwealth of Virginia, U.S.A.)

A friendly visitor to what hopefully will remain a friendly Forum.
With thanks to all of the dedicated volunteers on the staff here!!
Post Reply