Page 8 of 25

Re: 171.64.65.56 is in Reject status

Posted: Thu Nov 27, 2008 3:28 am
by Xilikon
Same here, I have a LinSMP box trying to get work (upload worked fine) since a little while.

Re: 171.64.65.56 is in Reject status

Posted: Thu Nov 27, 2008 3:41 am
by P5-133XL
My problem has been fixed: I just got WU from the server.

Thanks.

Re: 171.64.65.56 is in Reject status

Posted: Thu Nov 27, 2008 12:03 pm
by noorman
.

Now I know why I find this server so loaded (CPU and Net) ...
EDIT: ( my system was also waiting to upload and to get new Work / been on the go again now for ~8.5 hrs )

Seems it also has a low WU available count !
( may be due to the fact that it didn't accept uploads from which to create new work )

.

Re: 171.64.65.56 is in Reject status

Posted: Fri Nov 28, 2008 5:06 pm
by AgrFan
171.64.65.56 is down again :(

Re: 171.64.65.56 is in Reject status

Posted: Fri Nov 28, 2008 5:47 pm
by noorman
AgrFan wrote:171.64.65.56 is down again :(
.


Passed on to Dr. Kasson and Prof. Pande


.

Re: 171.64.65.56 is in Reject status

Posted: Fri Nov 28, 2008 6:26 pm
by VijayPande
should be back up shortly. Thanks for the heads up.

Re: 171.64.65.56 is in Reject status

Posted: Fri Nov 28, 2008 6:37 pm
by P5-133XL
Is there a reason, that this particular server seems to have an exceptional amount of down time? And what can be done, on a long term basis ...

Re: 171.64.65.56 is in Reject status

Posted: Fri Nov 28, 2008 7:00 pm
by VijayPande
P5-133XL wrote:Is there a reason, that this particular server seems to have an exceptional amount of down time?
It has to do with the http library that the server code uses. Physically, the server is fine. However, when it gets overloaded, it just fails to continue and the binary must be killed.
And what can be done, on a long term basis ...
We have been looking into solutions for the past two years. Two years ago, we refactored the server code to clean up elements and try to isolate the problem (creating the v4 branch of the server code from v3). That did help, but not enough. One feature of the v4 code is that it can detect when it has this problem and restart itself, but we don't want it falsely restarting itself (that can lead to other problems), so the timeout is somewhat long.

About a year ago, we took a radically different approach and worked with an outside programming company (Joseph Coffland from Cauldron) to rewrite the server code from scratch. That new server code (v5) is now under beta testing. Project 3799 and 3798 are run on the new server platform. We are rolling it out gradually to avoid any sort of disaster, but so far so good. I expect that it will see its first duty in production projects in early 2009 (likely January 2009) and a more complete roll out throughout the year.

The SMP and GPU servers will get it last, since they need additional code to bring the v5 code path up to spec with what the GPU and SMP needs. However, we expect this won't be too onerous to get done.

Re: 171.64.65.56 is in Reject status

Posted: Fri Nov 28, 2008 7:14 pm
by AgrFan
Thanks for the update Vijay.

Can you provide some information on the future of the 26xx units and where you are with the full migration to the A2 core?

I've been receiving primarily 2669 and 2675 units lately on my Linux boxes. Are we getting to the end of the cycle for the 26xx series? Once the new server code is rolled out, will the next series of SMP units be 37xx?

Re: 171.64.65.56 is in Reject status

Posted: Fri Nov 28, 2008 7:26 pm
by TheWolf
I've worked 312 P3798 as of my checking just a min. ago.
Picked another 4 as I was checking, finished all those withoutt a problem then picked up a P4623.
Didn't know I had been getting them till a few mins. ago. Guess there giving them out to anyone in spirts?

Re: 171.64.65.56 is in Reject status

Posted: Fri Nov 28, 2008 7:55 pm
by VijayPande
TheWolf wrote:I've worked 312 P3798 as of my checking just a min. ago.
Picked another 4 as I was checking, finished all those withoutt a problem then picked up a P4623.
Didn't know I had been getting them till a few mins. ago. Guess there giving them out to anyone in spirts?
Please keep in mind that new WU's are created only after the old ones come back. So if a given project is low on WU's, there won't be WU's until old ones come back. That can lead to the "spurty" behavior you're seeing.

Re: 171.64.65.56 is in Reject status

Posted: Fri Nov 28, 2008 9:01 pm
by alberts1414
Server is up and running, just got my backlog sent.
Cheers Jeff

Re: 171.64.65.56 is in Reject status

Posted: Sat Nov 29, 2008 6:13 am
by TheWolf
VijayPande wrote:
TheWolf wrote:I've worked 312 P3798 as of my checking just a min. ago.
Picked another 4 as I was checking, finished all those withoutt a problem then picked up a P4623.
Didn't know I had been getting them till a few mins. ago. Guess there giving them out to anyone in spirts?
Please keep in mind that new WU's are created only after the old ones come back. So if a given project is low on WU's, there won't be WU's until old ones come back. That can lead to the "spurty" behavior you're seeing.
Thanks for the heads up.
I can see now why, Q6600 8x400=3,200 2x1gb x400Mhz Only taking 1min 12sec to finish a WU.
Sorry I couldn't figure out how to put in a scroll box on this site. I guess everything looks ok with the below?

Code: Select all

[18:53:54] Project: 3798 (Run 82, Clone 2, Gen 42)
[18:53:54] 
[18:53:54] Assembly optimizations on if available.
[18:53:54] Entering M.D.
[18:54:00] Protein: p3798
[18:54:00] 
[18:54:00] Writing local files
[18:55:18] Extra SSE boost OK.
[18:55:18] Writing local files
[18:55:18] Completed 0 out of 1500 steps  (0%)
[18:55:42] Writing local files
[18:55:42] Completed 500 out of 1500 steps  (33%)
[18:56:06] Writing local files
[18:56:06] Completed 1000 out of 1500 steps  (67%)
[18:56:30] Writing local files
[18:56:30] Completed 1500 out of 1500 steps  (100%)
[18:56:30] Writing final coordinates.
[18:56:30] Past main M.D. loop
[18:57:30] 
[18:57:30] Finished Work Unit:
[18:57:30] - Reading up to 550656 from "work/wudata_09.arc": Read 550656
[18:57:30] - Reading up to 0 from "work/wudata_09.xtc": Read 0
[18:57:30] goefile size: 0
[18:57:30] Leaving Run
[18:57:34] - Writing 573212 bytes of core data to disk...
[18:57:34] Done: 572700 -> 518168 (compressed to 90.4 percent)
[18:57:34]   ... Done.
[18:57:34] - Shutting down core
[18:57:34] 
[18:57:34] Folding@home Core Shutdown: FINISHED_UNIT
[18:57:38] CoreStatus = 64 (100)
[18:57:38] Sending work to server
[18:57:38] Project: 3798 (Run 82, Clone 2, Gen 42)


[18:57:38] + Attempting to send results [November 28 18:57:38 UTC]
[18:57:52] + Results successfully sent
[18:57:52] Thank you for your contribution to Folding@Home.
[18:57:52] + Number of Units Completed: 129

[18:57:56] - Preparing to get new work unit...
[18:57:56] + Attempting to get work packet
[18:57:56] - Connecting to assignment server
[18:57:56] - Successful: assigned to (171.64.122.139).
[18:57:56] + News From Folding@Home: Welcome to Folding@Home
[18:57:56] Loaded queue successfully.
[18:58:02] + Closed connections
[18:58:02] 
[18:58:02] + Processing work unit
[18:58:02] Core required: FahCore_78.exe
[18:58:02] Core found.
[18:58:02] Working on queue slot 00 [November 28 18:58:02 UTC]
[18:58:02] + Working ...
[18:58:02] 
[18:58:02] *------------------------------*
[18:58:02] Folding@Home Gromacs Core
[18:58:02] Version 1.86 (August 28, 2005)
[18:58:02] 
[18:58:02] Preparing to commence simulation
[18:58:02] - Assembly optimizations manually forced on.
[18:58:02] - Not checking prior termination.
[18:58:02] - Expanded 237515 -> 1167708 (decompressed 491.6 percent)
[18:58:02] - Starting from initial work packet
[18:58:02] 
[18:58:02] Project: 3798 (Run 82, Clone 0, Gen 36)
[18:58:02] 
[18:58:02] Assembly optimizations on if available.
[18:58:02] Entering M.D.
[18:58:08] Protein: p3798
[18:58:08] 
[18:58:08] Writing local files
[18:59:26] Extra SSE boost OK.
[18:59:26] Writing local files
[18:59:26] Completed 0 out of 1500 steps  (0%)
[18:59:50] Writing local files
[18:59:50] Completed 500 out of 1500 steps  (33%)
[19:00:14] Writing local files
[19:00:14] Completed 1000 out of 1500 steps  (67%)
[19:00:38] Writing local files
[19:00:38] Completed 1500 out of 1500 steps  (100%)
[19:00:38] Writing final coordinates.
[19:00:39] Past main M.D. loop
[19:01:39] 
[19:01:39] Finished Work Unit:
[19:01:39] - Reading up to 550656 from "work/wudata_00.arc": Read 550656
[19:01:39] - Reading up to 0 from "work/wudata_00.xtc": Read 0
[19:01:39] goefile size: 0
[19:01:39] Leaving Run
[19:01:43] - Writing 573212 bytes of core data to disk...
[19:01:43] Done: 572700 -> 518906 (compressed to 90.6 percent)
[19:01:43]   ... Done.
[19:01:43] - Shutting down core
[19:01:43] 
[19:01:43] Folding@home Core Shutdown: FINISHED_UNIT
[19:01:46] CoreStatus = 64 (100)
[19:01:46] Sending work to server
[19:01:46] Project: 3798 (Run 82, Clone 0, Gen 36)


[19:01:46] + Attempting to send results [November 28 19:01:46 UTC]
[19:02:00] + Results successfully sent
[19:02:00] Thank you for your contribution to Folding@Home.
[19:02:00] + Number of Units Completed: 130

[19:02:04] - Preparing to get new work unit...
[19:02:04] + Attempting to get work packet
[19:02:04] - Connecting to assignment server
[19:02:05] - Successful: assigned to (171.64.122.139).
[19:02:05] + News From Folding@Home: Welcome to Folding@Home
[19:02:05] Loaded queue successfully.
[19:02:08] + Closed connections
[19:02:08] 
[19:02:08] + Processing work unit
[19:02:08] Core required: FahCore_78.exe
[19:02:08] Core found.
[19:02:08] Working on queue slot 01 [November 28 19:02:08 UTC]
[19:02:08] + Working ...
[19:02:08] 
[19:02:08] *------------------------------*
[19:02:08] Folding@Home Gromacs Core
[19:02:08] Version 1.86 (August 28, 2005)
[19:02:08] 
[19:02:08] Preparing to commence simulation
[19:02:08] - Assembly optimizations manually forced on.
[19:02:08] - Not checking prior termination.
[19:02:08] - Expanded 238158 -> 1167708 (decompressed 490.3 percent)
[19:02:08] - Starting from initial work packet
[19:02:08] 
[19:02:08] Project: 3798 (Run 70, Clone 4, Gen 30)
[19:02:08] 
[19:02:08] Assembly optimizations on if available.
[19:02:08] Entering M.D.
[19:02:15] Protein: p3798
[19:02:15] 
[19:02:15] Writing local files
[19:03:32] Extra SSE boost OK.
[19:03:32] Writing local files
[19:03:32] Completed 0 out of 1500 steps  (0%)
[19:03:56] Writing local files
[19:03:56] Completed 500 out of 1500 steps  (33%)
[19:04:21] Writing local files
[19:04:21] Completed 1000 out of 1500 steps  (67%)
[19:04:45] Writing local files
[19:04:45] Completed 1500 out of 1500 steps  (100%)
[19:04:45] Writing final coordinates.
[19:04:45] Past main M.D. loop
[19:05:45] 
[19:05:45] Finished Work Unit:
[19:05:45] - Reading up to 550656 from "work/wudata_01.arc": Read 550656
[19:05:45] - Reading up to 0 from "work/wudata_01.xtc": Read 0
[19:05:45] goefile size: 0
[19:05:45] Leaving Run
[19:05:49] - Writing 573212 bytes of core data to disk...
[19:05:49] Done: 572700 -> 518953 (compressed to 90.6 percent)
[19:05:49]   ... Done.
[19:05:49] - Shutting down core
[19:05:49] 
[19:05:49] Folding@home Core Shutdown: FINISHED_UNIT
[19:05:52] CoreStatus = 64 (100)
[19:05:52] Sending work to server
[19:05:52] Project: 3798 (Run 70, Clone 4, Gen 30)


[19:05:52] + Attempting to send results [November 28 19:05:52 UTC]
[19:06:16] + Results successfully sent
[19:06:16] Thank you for your contribution to Folding@Home.
[19:06:16] + Number of Units Completed: 131

[19:06:20] - Preparing to get new work unit...
[19:06:20] + Attempting to get work packet
[19:06:20] - Connecting to assignment server
[19:06:20] - Successful: assigned to (171.64.122.139).
[19:06:20] + News From Folding@Home: Welcome to Folding@Home
[19:06:20] Loaded queue successfully.
[19:06:34] + Could not get Work unit data from Work Server
[19:06:34] - Attempt #1  to get work failed, and no other work to do.
Waiting before retry.
[19:06:45] + Attempting to get work packet
[19:06:45] - Connecting to assignment server
[19:06:53] - Successful: assigned to (171.64.122.139).
[19:06:53] + News From Folding@Home: Welcome to Folding@Home
[19:06:54] Loaded queue successfully.
[19:06:56] + Closed connections
[19:06:56] 
[19:06:56] + Processing work unit
[19:06:56] Core required: FahCore_78.exe
[19:06:56] Core found.
[19:06:56] Working on queue slot 02 [November 28 19:06:56 UTC]
[19:06:57] + Working ...
[19:06:57] 
[19:06:57] *------------------------------*
[19:06:57] Folding@Home Gromacs Core
[19:06:57] Version 1.86 (August 28, 2005)
[19:06:57] 
[19:06:57] Preparing to commence simulation
[19:06:57] - Assembly optimizations manually forced on.
[19:06:57] - Not checking prior termination.
[19:06:57] - Expanded 237921 -> 1167708 (decompressed 490.7 percent)
[19:06:57] - Starting from initial work packet
[19:06:57] 
[19:06:57] Project: 3798 (Run 64, Clone 0, Gen 46)
[19:06:57] 
[19:06:57] Assembly optimizations on if available.
[19:06:57] Entering M.D.
[19:07:03] Protein: p3798
[19:07:10] 
[19:07:10] Writing local files
[19:08:20] Extra SSE boost OK.
[19:08:20] Writing local files
[19:08:20] Completed 0 out of 1500 steps  (0%)
[19:08:44] Writing local files
[19:08:45] Completed 500 out of 1500 steps  (33%)
[19:09:08] Writing local files
[19:09:08] Completed 1000 out of 1500 steps  (67%)
[19:09:32] Writing local files
[19:09:32] Completed 1500 out of 1500 steps  (100%)
[19:09:32] Writing final coordinates.
[19:09:32] Past main M.D. loop
[19:10:32] 
[19:10:32] Finished Work Unit:
[19:10:32] - Reading up to 550656 from "work/wudata_02.arc": Read 550656
[19:10:32] - Reading up to 0 from "work/wudata_02.xtc": Read 0
[19:10:32] goefile size: 0
[19:10:32] Leaving Run
[19:10:37] - Writing 573212 bytes of core data to disk...
[19:10:37] Done: 572700 -> 518265 (compressed to 90.4 percent)
[19:10:37]   ... Done.
[19:10:37] - Shutting down core
[19:10:37] 
[19:10:37] Folding@home Core Shutdown: FINISHED_UNIT
[19:10:41] CoreStatus = 64 (100)
[19:10:41] Sending work to server
[19:10:41] Project: 3798 (Run 64, Clone 0, Gen 46)


[19:10:41] + Attempting to send results [November 28 19:10:41 UTC]
[19:11:05] + Results successfully sent
[19:11:05] Thank you for your contribution to Folding@Home.
[19:11:05] + Number of Units Completed: 132

[19:11:09] - Preparing to get new work unit...
[19:11:09] + Attempting to get work packet
[19:11:09] - Connecting to assignment server
[19:11:09] - Successful: assigned to (171.64.122.139).
[19:11:09] + News From Folding@Home: Welcome to Folding@Home
[19:11:09] Loaded queue successfully.
[19:11:12] + Closed connections
[19:11:12] 
[19:11:12] + Processing work unit
[19:11:12] Core required: FahCore_78.exe
[19:11:12] Core found.
[19:11:12] Working on queue slot 03 [November 28 19:11:12 UTC]
[19:11:12] + Working ...
[19:11:12] 
[19:11:12] *------------------------------*
[19:11:12] Folding@Home Gromacs Core
[19:11:12] Version 1.86 (August 28, 2005)
[19:11:12] 
[19:11:12] Preparing to commence simulation
[19:11:12] - Assembly optimizations manually forced on.
[19:11:12] - Not checking prior termination.
[19:11:12] - Expanded 238712 -> 1167708 (decompressed 489.1 percent)
[19:11:12] - Starting from initial work packet
[19:11:12] 
[19:11:12] Project: 3798 (Run 86, Clone 3, Gen 43)
[19:11:12] 
[19:11:12] Assembly optimizations on if available.
[19:11:12] Entering M.D.
[19:11:19] Protein: p3798
[19:11:19] 
[19:11:19] Writing local files
[19:12:37] Extra SSE boost OK.
[19:12:37] Writing local files
[19:12:37] Completed 0 out of 1500 steps  (0%)
[19:13:01] Writing local files
[19:13:01] Completed 500 out of 1500 steps  (33%)
[19:13:25] Writing local files
[19:13:25] Completed 1000 out of 1500 steps  (67%)
[19:13:49] Writing local files
[19:13:49] Completed 1500 out of 1500 steps  (100%)
[19:13:49] Writing final coordinates.
[19:13:49] Past main M.D. loop
[19:14:49] 
[19:14:49] Finished Work Unit:
[19:14:49] - Reading up to 550656 from "work/wudata_03.arc": Read 550656
[19:14:49] - Reading up to 0 from "work/wudata_03.xtc": Read 0
[19:14:49] goefile size: 0
[19:14:49] Leaving Run
[19:14:53] - Writing 573212 bytes of core data to disk...
[19:14:53] Done: 572700 -> 517928 (compressed to 90.4 percent)
[19:14:53]   ... Done.
[19:14:53] - Shutting down core
[19:14:53] 
[19:14:53] Folding@home Core Shutdown: FINISHED_UNIT
[19:14:57] CoreStatus = 64 (100)
[19:14:57] Sending work to server
[19:14:57] Project: 3798 (Run 86, Clone 3, Gen 43)


[19:14:57] + Attempting to send results [November 28 19:14:57 UTC]
[19:15:20] + Results successfully sent
[19:15:20] Thank you for your contribution to Folding@Home.
[19:15:20] + Number of Units Completed: 133

[19:15:24] - Preparing to get new work unit...
[19:15:24] + Attempting to get work packet
[19:15:24] - Connecting to assignment server
[19:15:24] - Successful: assigned to (171.64.122.139).
[19:15:24] + News From Folding@Home: Welcome to Folding@Home
[19:15:25] Loaded queue successfully.
[19:15:27] + Closed connections
[19:15:27] 
[19:15:27] + Processing work unit
[19:15:27] Core required: FahCore_78.exe
[19:15:27] Core found.
[19:15:27] Working on queue slot 04 [November 28 19:15:27 UTC]
[19:15:27] + Working ...
[19:15:28] 
[19:15:28] *------------------------------*
[19:15:28] Folding@Home Gromacs Core
[19:15:28] Version 1.86 (August 28, 2005)
[19:15:28] 
[19:15:28] Preparing to commence simulation
[19:15:28] - Assembly optimizations manually forced on.
[19:15:28] - Not checking prior termination.
[19:15:28] - Expanded 238197 -> 1167708 (decompressed 490.2 percent)
[19:15:28] - Starting from initial work packet
[19:15:28] 
[19:15:28] Project: 3798 (Run 76, Clone 0, Gen 47)
[19:15:28] 
[19:15:28] Assembly optimizations on if available.
[19:15:28] Entering M.D.
[19:15:34] Protein: p3798
[19:15:34] 
[19:15:34] Writing local files
[19:16:52] Extra SSE boost OK.
[19:16:52] Writing local files
[19:16:52] Completed 0 out of 1500 steps  (0%)
[19:17:16] Writing local files
[19:17:16] Completed 500 out of 1500 steps  (33%)
[19:17:41] Writing local files
[19:17:41] Completed 1000 out of 1500 steps  (67%)
[19:18:05] Writing local files
[19:18:05] Completed 1500 out of 1500 steps  (100%)
[19:18:05] Writing final coordinates.
[19:18:05] Past main M.D. loop
[19:19:05] 
[19:19:05] Finished Work Unit:
[19:19:05] - Reading up to 550656 from "work/wudata_04.arc": Read 550656
[19:19:05] - Reading up to 0 from "work/wudata_04.xtc": Read 0
[19:19:05] goefile size: 0
[19:19:05] Leaving Run
[19:19:08] - Writing 573212 bytes of core data to disk...
[19:19:08] Done: 572700 -> 519023 (compressed to 90.6 percent)
[19:19:08]   ... Done.
[19:19:08] - Shutting down core
[19:19:08] 
[19:19:08] Folding@home Core Shutdown: FINISHED_UNIT
[19:19:12] CoreStatus = 64 (100)
[19:19:12] Sending work to server
[19:19:12] Project: 3798 (Run 76, Clone 0, Gen 47)


[19:19:12] + Attempting to send results [November 28 19:19:12 UTC]
[19:19:35] + Results successfully sent
[19:19:35] Thank you for your contribution to Folding@Home.
[19:19:35] + Number of Units Completed: 134

Re: 171.64.65.56 is in Reject status

Posted: Mon Dec 01, 2008 12:30 am
by AgrFan
P5-133XL wrote:Is there a reason, that this particular server seems to have an exceptional amount of down time? And what can be done, on a long term basis ...
This server is low on work and appears to have issues with hard disk space and high CPU load. The SMP server stat page has been displaying WUs AVAIL < 750 (bold), DL < 2 days (yellow) and high CPU load (red) over the past few weeks.

http://fah-web.stanford.edu/localinfo/contact.SMP.html

Earlier in this thread there was a post talking about "leaking jobs" and trying to figure out a way to reclaim them. I never saw a update saying this was fixed. Obviously, this server is overloaded with users running LinSMP for the A2 units.

I've been happily crunching WUs from this server with the notfred disk for quite some time now. The server issues began when the A1 units (eg. 2605) started to disappear from this server. I highly doubt the recent problems are related to the server code rewrite effort. It does sound like the new server code will eventually help the situation.

Could the low amount of work be related to the pending rollout of the A2 core on WinSMP :?:

Re: 171.64.65.56 is in Reject status

Posted: Tue Dec 02, 2008 9:16 pm
by ArVee
.65.56 is in Reject mode again. I really have to ask how many times this needs to be reported before it sinks in that there may be a workload or balancing problem. I mean c'mon, this is beyond ineptitude and right into ridiculous. Why don't you just get to it and address this with an eye to a permanent solution?