Folding Forum

Posted: **Tue Feb 23, 2010 2:32 am**

looks like no need to post log file as this seems to be an on going issue.
server in question = 171.64.65.71
on the plus side this has just happened on one of my gpu clients out of four (4)
The question now is to allow the continuation of seeking another WU or turn off the gpu client?

*Edit
Received WU

(decided to leave gpu client running)

Code: Select all

[02:03:07] - Attempt #1  to get work failed, and no other work to do.
Waiting before retry.

[02:15:02] - Attempt #8  to get work failed, and no other work to do.
Waiting before retry.

[02:25:46] + Attempting to get work packet
[02:25:46] - Will indicate memory of 1023 MB
[02:25:46] - Connecting to assignment server
[02:25:46] Connecting to http://assign-GPU.stanford.edu:8080/
[02:25:47] Posted data.
[02:25:47] Initial: 40AB; - Successful: assigned to (171.64.65.20).
[02:25:47] + News From Folding@Home: Welcome to Folding@Home
[02:25:47] Loaded queue successfully.
[02:25:47] Connecting to http://171.64.65.20:8080/
[02:25:48] Posted data.
[02:25:48] Initial: 0000; - Receiving payload (expected size: 70724)
[02:25:48] Conversation time very short, giving reduced weight in bandwidth avg
[02:25:48] - Downloaded at ~138 kB/s
[02:25:48] - Averaged speed for that direction ~63 kB/s
[02:25:48] + Received work.

keep up the good work,thx. Joe!

Posted: **Tue Feb 23, 2010 3:36 pm**

Is there any status on the WUs that are still in the state of "could not transmit" but then "server has already received" such as this one here, the first one I saw happen, on 13 Feb 2010 @ 23:06:03 PST:

Code: Select all

[07:06:01] Folding@home Core Shutdown: FINISHED_UNIT
[07:06:03] CoreStatus = 64 (100)
[07:06:03] Sending work to server
[07:06:03] Project: 3470 (Run 16, Clone 51, Gen 1)
[07:06:03] - Read packet limit of 540015616... Set to 524286976.


[07:06:03] + Attempting to send results [February 14 07:06:03 UTC]
[07:06:04] - Couldn't send HTTP request to server
[07:06:04] + Could not connect to Work Server (results)
[07:06:04]     (171.67.108.21:8080)
[07:06:04] + Retrying using alternative port
[07:06:25] - Couldn't send HTTP request to server
[07:06:25] + Could not connect to Work Server (results)
[07:06:25]     (171.67.108.21:80)
[07:06:25] - Error: Could not transmit unit 02 (completed February 14) to work server.
[07:06:25]   Keeping unit 02 in queue.
[07:06:25] Project: 3470 (Run 16, Clone 51, Gen 1)
[07:06:25] - Read packet limit of 540015616... Set to 524286976.


[07:06:25] + Attempting to send results [February 14 07:06:25 UTC]
[07:06:25] - Server has already received unit.
[07:06:25] - Preparing to get new work unit...
[07:06:25] + Attempting to get work packet
[07:06:25] - Connecting to assignment server
[07:06:26] - Successful: assigned to (171.67.108.21).
[07:06:26] + News From Folding@Home: Welcome to Folding@Home
[07:06:26] Loaded queue successfully.
[07:06:26] + Closed connections

I have 10 such WUs from Feb 14-15 that will expire at some point soon. To my reasoning, I think that they fall in to one of these categories:

1. Not uploaded, and server is broken such that the WUs will simply expire and be reassigned.
2. Not uploaded, and server is broken such that the WUs are marked as done but not actually be done, leading to a loss of science until it is discovered.
3. Uploaded, and my client did not get notified properly and I did not receive the credit for them.
4. Uploaded, and my client did not get notified properly and I received the credit for them.

I think #4 is the least likely as I really didn't see any extra ~3000-4000 points hit my stats. I am still thinking that #2 is the most probable.

Is there any chance that these WUs can be uploaded properly or should I just forget about it and delete my archive of the two GPU clients I have with the 10 WUs?

Here is the full list of project WUs that I have in this state:

Code: Select all

P3470, r16, c51,  g1
P3470, r14, c112, g2
P5781, r22, c982, g3
P5781, r28, c199, g3
P5781, r35, c554, g3
P5781, r6,  c539, g4
P5781, r13, c935, g4
P5781, r15, c764, g3
P5781, r22, c303, g3
P5781, r33, c15,  g3
P5781, r9,  c612, g4

Posted: **Tue Feb 23, 2010 6:48 pm**

I was able to send after 45 minutes idling my completed WU but from the logs it is not clear where my WU is ended up, on the original server (171.67.108.21) or on the CS (171.67.108.26).
Needles to say 45 minutes in the life of this card, is ages

Code: Select all

[17:45:33] Folding@home Core Shutdown: FINISHED_UNIT
[17:45:35] CoreStatus = 64 (100)
[17:45:35] Sending work to server
[17:45:35] Project: 5784 (Run 8, Clone 73, Gen 59)
[17:45:35] - Read packet limit of 540015616... Set to 524286976.


[17:45:35] + Attempting to send results [February 23 17:45:35 UTC]
[17:45:40] - Server does not have record of this unit. Will try again later.
[17:45:40] - Error: Could not transmit unit 05 (completed February 23) to work server.
[17:45:40]   Keeping unit 05 in queue.
[17:45:40] Project: 5784 (Run 8, Clone 73, Gen 59)
[17:45:40] - Read packet limit of 540015616... Set to 524286976.


[17:45:40] + Attempting to send results [February 23 17:45:40 UTC]
[17:45:42] - Server does not have record of this unit. Will try again later.
[17:45:42] - Error: Could not transmit unit 05 (completed February 23) to work server.
[17:45:42] - Read packet limit of 540015616... Set to 524286976.


[17:45:42] + Attempting to send results [February 23 17:45:42 UTC]
[18:35:52] + Could not connect to Work Server (results)
[18:35:52]     (171.67.108.26:8080)
[18:35:52] + Retrying using alternative port
[18:35:52] - Couldn't send HTTP request to server
[18:35:52] + Could not connect to Work Server (results)
[18:35:52]     (171.67.108.26:80)
[18:35:52]   Could not transmit unit 05 to Collection server; keeping in queue.
[18:35:52] - Preparing to get new work unit...
[18:35:52] + Attempting to get work packet
[18:35:52] - Connecting to assignment server
[18:35:52] - Successful: assigned to (171.64.65.20).
[18:35:52] + News From Folding@Home: Welcome to Folding@Home
[18:35:52] Loaded queue successfully.
[18:35:53] Project: 5784 (Run 8, Clone 73, Gen 59)
[18:35:53] - Read packet limit of 540015616... Set to 524286976.


[18:35:53] + Attempting to send results [February 23 18:35:53 UTC]
[18:35:56] + Results successfully sent
[18:35:56] Thank you for your contribution to Folding@Home.
[18:35:56] + Number of Units Completed: 344

[18:35:56] + Closed connections
[18:35:56] 
[18:35:56] + Processing work unit
[18:35:56] Core required: FahCore_14.exe
[18:35:56] Core found.
[18:35:56] Working on queue slot 06 [February 23 18:35:56 UTC]
[18:35:56] + Working ...
[18:35:56]

Posted: **Tue Feb 23, 2010 7:36 pm**

ikerekes wrote:I was able to send after 45 minutes idling my completed WU but from the logs it is not clear where my WU is ended up, on the original server (171.67.108.21) or on the CS (171.67.108.26).
Needles to say 45 minutes in the life of this card, is ages

Code: Select all

[17:45:33] Folding@home Core Shutdown: FINISHED_UNIT
[17:45:35] CoreStatus = 64 (100)
[17:45:35] Sending work to server
[17:45:35] Project: 5784 (Run 8, Clone 73, Gen 59)
[17:45:35] - Read packet limit of 540015616... Set to 524286976.


[17:45:35] + Attempting to send results [February 23 17:45:35 UTC]
[17:45:40] - Server does not have record of this unit. Will try again later.
[17:45:40] - Error: Could not transmit unit 05 (completed February 23) to work server.
[17:45:40]   Keeping unit 05 in queue.
[17:45:40] Project: 5784 (Run 8, Clone 73, Gen 59)
[17:45:40] - Read packet limit of 540015616... Set to 524286976.


[17:45:40] + Attempting to send results [February 23 17:45:40 UTC]
[17:45:42] - Server does not have record of this unit. Will try again later.
[17:45:42] - Error: Could not transmit unit 05 (completed February 23) to work server.
[17:45:42] - Read packet limit of 540015616... Set to 524286976.


[17:45:42] + Attempting to send results [February 23 17:45:42 UTC]
[18:35:52] + Could not connect to Work Server (results)
[18:35:52]     (171.67.108.26:8080)
[18:35:52] + Retrying using alternative port
[18:35:52] - Couldn't send HTTP request to server
[18:35:52] + Could not connect to Work Server (results)
[18:35:52]     (171.67.108.26:80)
[18:35:52]   Could not transmit unit 05 to Collection server; keeping in queue.
[18:35:52] - Preparing to get new work unit...
[18:35:52] + Attempting to get work packet
[18:35:52] - Connecting to assignment server
[18:35:52] - Successful: assigned to (171.64.65.20).
[18:35:52] + News From Folding@Home: Welcome to Folding@Home
[18:35:52] Loaded queue successfully.
[18:35:53] Project: 5784 (Run 8, Clone 73, Gen 59)
[18:35:53] - Read packet limit of 540015616... Set to 524286976.


[18:35:53] + Attempting to send results [February 23 18:35:53 UTC]
[18:35:56] + Results successfully sent
[18:35:56] Thank you for your contribution to Folding@Home.
[18:35:56] + Number of Units Completed: 344

[18:35:56] + Closed connections
[18:35:56] 
[18:35:56] + Processing work unit
[18:35:56] Core required: FahCore_14.exe
[18:35:56] Core found.
[18:35:56] Working on queue slot 06 [February 23 18:35:56 UTC]
[18:35:56] + Working ...
[18:35:56]

.

Since 171.64.65.20 is the only server to which a connection was made (successfully), IMO the results were sent to that server (too)

171.67.108.26 is still very heavily loaded on its Network ...

.

Posted: **Tue Feb 23, 2010 10:03 pm**

DrSpalding wrote:Is there any status on the WUs that are still in the state of "could not transmit" but then "server has already received" such as this one here, the first one I saw happen, on 13 Feb 2010 @ 23:06:03 PST:

Code: Select all

I have 10 such WUs from Feb 14-15 that will expire at some point soon.  To my reasoning, I think that they fall in to one of these categories:

1. Not uploaded, and server is broken such that the WUs will simply expire and be reassigned.
2. Not uploaded, and server is broken such that the WUs are marked as done but not actually be done, leading to a loss of science until it is discovered.
3. Uploaded, and my client did not get notified properly and I did not receive the credit for them.
4. Uploaded, and my client did not get notified properly and I received the credit for them.

I think #4 is the least likely as I really didn't see any extra ~3000-4000 points hit my stats.  I am still thinking that #2 is the most probable.

Is there any chance that these WUs can be uploaded properly or should I just forget about it and delete my archive of the two GPU clients I have with the 10 WUs?

Here is the full list of project WUs that I have in this state:
[code]P3470, r16, c51,  g1
P3470, r14, c112, g2
P5781, r22, c982, g3
P5781, r28, c199, g3
P5781, r35, c554, g3
P5781, r6,  c539, g4
P5781, r13, c935, g4
P5781, r15, c764, g3
P5781, r22, c303, g3
P5781, r33, c15,  g3
P5781, r9,  c612, g4

I checked on a few of these. I don't see that there's a single consistent answer.

I cannot find a record of P5781, r9, c612, g4 being credited to you but it was assigned to someone else 2010-02-08 19:53:24 PST and completed by them. I'd say that it's a #1, though the Mod database was reactivated recently so there are other reasons why I might not see a record of your WU. Would a reassignment at [03:53] UTC be consistent with either a normal expiration from the original assignment date-time or from some other event in your FAHlog?

Project: 5781, Run 33, Clone 15, Gen 3, / Run 22, Clone 303, Gen 3 / Run 22, Clone 982, Gen 3 - No data back from queries. Could be #1 or #2, depending on how many assignment/timeout cycles it takes to get a result uploaded -- or whether my DB is just incomplete.

Project: 3470, Run 16, Clone 51, Gen 1 reassigned and completed a number of times.

Posted: **Wed Feb 24, 2010 1:11 am**

bruce wrote:
I checked on a few of these. I don't see that there's a single consistent answer.

I cannot find a record of P5781, r9, c612, g4 being credited to you but it was assigned to someone else 2010-02-08 19:53:24 PST and completed by them. I'd say that it's a #1, though the Mod database was reactivated recently so there are other reasons why I might not see a record of your WU. Would a reassignment at [03:53] UTC be consistent with either a normal expiration from the original assignment date-time or from some other event in your FAHlog?

Project: 5781, Run 33, Clone 15, Gen 3, / Run 22, Clone 303, Gen 3 / Run 22, Clone 982, Gen 3 - No data back from queries. Could be #1 or #2, depending on how many assignment/timeout cycles it takes to get a result uploaded -- or whether my DB is just incomplete.

Project: 3470, Run 16, Clone 51, Gen 1 reassigned and completed a number of times.

Sorry, my log files scrolled out of existence. FAHlog-Prev.txt starts on:
--- Opening Log file [February 13 18:25:26 UTC]

so I have no idea about any event like a reassignment or the like.

I guess I shouldn't worry about these units--they will likely get reassigned anyway, if they are not already done, and the trouble to get them uploaded properly with my wuresults_XX.dat files is probably not worth it. I will keep the files for a while longer but if I don't hear anything back from you or Pande Group about it, I'll just consider them toast and move on from here.

Thanks,

Dan

Posted: **Wed Feb 24, 2010 1:32 am**

Here's an update. The WS's look to be in pretty good shape, but the CS code still has known issues and bad behavior that Joe is addressing. For what it's worth, it looks like all of this did expose several problems in the code which Joe has now fixed or is fixing, so I think this has hardened it considerably. I hope Joe will have a CS fix shortly (day or two), but it's too early to guarantee an ETA since he's still making sure he understands the failure mode completely.

Posted: **Wed Feb 24, 2010 3:10 am**

VijayPande wrote:Here's an update. The WS's look to be in pretty good shape, but the CS code still has known issues and bad behavior that Joe is addressing. For what it's worth, it looks like all of this did expose several problems in the code which Joe has now fixed or is fixing, so I think this has hardened it considerably. I hope Joe will have a CS fix shortly (day or two), but it's too early to guarantee an ETA since he's still making sure he understands the failure mode completely.

Thank you for the update Prof. Pande.

I have 7 gpu client running and today only two client hang up for 45 minutes both on 171.67.108.21 (two hours after the WS issued the work unit, didn't have a record of it. One of the log is just 3 post ahead of this post).
I wouldn't call it pretty good shape but definitely better than was 10 days ago

Hope for ironing out the last wrinkles, and we all can return to contributing to the science.

Posted: **Wed Feb 24, 2010 3:17 am**

My nvidias have been running better of late, but I still get stuck now and then. Here is something on .71 now.
Waiting before retry.
[02:54:54] + Attempting to get work packet
[02:54:54] - Connecting to assignment server
[02:54:54] - Successful: assigned to (171.64.65.71).
[02:54:54] + News From Folding@Home: Welcome to Folding@Home
[02:54:55] Loaded queue successfully.
[02:54:55] - Couldn't send HTTP request to server
[02:54:55] + Could not connect to Work Server
[02:54:55] - Attempt #8 to get work failed, and no other work to do.
Waiting before retry.
[03:05:38] + Attempting to get work packet
[03:05:38] - Connecting to assignment server
[03:05:39] - Successful: assigned to (171.64.65.71).
[03:05:39] + News From Folding@Home: Welcome to Folding@Home
[03:05:39] Loaded queue successfully.
[03:05:39] - Couldn't send HTTP request to server
[03:05:39] + Could not connect to Work Server
[03:05:39] - Attempt #9 to get work failed, and no other work to do.
Waiting before retry.

I'm shutting down for a while. I'll check back every month to see if it gets straightened out.

David

Posted: **Wed Feb 24, 2010 10:23 am**

Just wanted to thanks the guys for fixing this , much appreciated

Posted: **Wed Feb 24, 2010 2:13 pm**

BTW, we've taken down the 171.67.108.26 vsp09a CS until we can get that code fixed. Right now, it looks like it isn't helping, but rather hurting clients (delays them but doesn't take back their WU's). Joe is working on it and will put it back on line when it's working.

Posted: **Wed Feb 24, 2010 2:21 pm**

ikerekes wrote:
VijayPande wrote:Here's an update. The WS's look to be in pretty good shape, but the CS code still has known issues and bad behavior that Joe is addressing. For what it's worth, it looks like all of this did expose several problems in the code which Joe has now fixed or is fixing, so I think this has hardened it considerably. I hope Joe will have a CS fix shortly (day or two), but it's too early to guarantee an ETA since he's still making sure he understands the failure mode completely.
Thank you for the update Prof. Pande.

I have 7 gpu client running and today only two client hang up for 45 minutes both on 171.67.108.21 (two hours after the WS issued the work unit, didn't have a record of it. One of the log is just 3 post ahead of this post).
I wouldn't call it pretty good shape but definitely better than was 10 days ago

Hope for ironing out the last wrinkles, and we all can return to contributing to the science.

Thanks, this sounds like progress. It sounds like you're not having problems with the WS but only the CS (.21 is a CS). This and other reports made me decide to take down the CS until we can get it working. Since it's not helping and only slowing down clients, I think we're better off this way until Joe fixes the CS. Moreover, the new WS code talks actively to the CS, so CS problems hurt the WS. Taking down the CS should help the WS's.

The upshot is that (hopefully) the WS's are in reasonable shape. I guess we'll see if that's true over the next day or so.

Posted: **Wed Feb 24, 2010 2:31 pm**

Thanks Dr. Pande for these updates. My GPU client is working much better than before but I noticed something weird in the F@H log, in the first few attempts to upload the completed WU, it gives "server has no record..." but later it successfully uploads the WU so what does this mean?

Thanks

Code: Select all

[20:26:06] Completed 90%
[20:29:21] Completed 91%
[20:32:28] Completed 92%
[20:35:29] Completed 93%
[20:38:29] Completed 94%
[20:41:32] Completed 95%
[20:44:40] Completed 96%
[20:47:40] Completed 97%
[20:50:42] Completed 98%
[20:53:53] Completed 99%
[20:57:02] Completed 100%
[20:57:02] Successful run
[20:57:02] DynamicWrapper: Finished Work Unit: sleep=10000
[20:57:12] Reserved 146032 bytes for xtc file; Cosm status=0
[20:57:12] Allocated 146032 bytes for xtc file
[20:57:12] - Reading up to 146032 from "work/wudata_06.xtc": Read 146032
[20:57:12] Read 146032 bytes from xtc file; available packet space=786284432
[20:57:12] xtc file hash check passed.
[20:57:12] Reserved 22272 22272 786284432 bytes for arc file=<work/wudata_06.trr> Cosm status=0
[20:57:12] Allocated 22272 bytes for arc file
[20:57:12] - Reading up to 22272 from "work/wudata_06.trr": Read 22272
[20:57:12] Read 22272 bytes from arc file; available packet space=786262160
[20:57:12] trr file hash check passed.
[20:57:12] Allocated 560 bytes for edr file
[20:57:12] Read bedfile
[20:57:12] edr file hash check passed.
[20:57:12] Logfile not read.
[20:57:12] GuardedRun: success in DynamicWrapper
[20:57:12] GuardedRun: done
[20:57:12] Run: GuardedRun completed.
[20:57:13] + Opened results file
[20:57:13] - Writing 169376 bytes of core data to disk...
[20:57:13] Done: 168864 -> 167392 (compressed to 99.1 percent)
[20:57:13]   ... Done.
[20:57:13] DeleteFrameFiles: successfully deleted file=work/wudata_06.ckp
[20:57:13] Shutting down core 
[20:57:13] 
[20:57:13] Folding@home Core Shutdown: FINISHED_UNIT
[20:57:18] CoreStatus = 64 (100)
[20:57:18] Sending work to server
[20:57:18] Project: 5782 (Run 1, Clone 97, Gen 29)
[20:57:18] - Read packet limit of 540015616... Set to 524286976.


[20:57:18] + Attempting to send results [February 23 20:57:18 UTC]
[20:59:52] - Server does not have record of this unit. Will try again later.
[20:59:52] - Error: Could not transmit unit 06 (completed February 23) to work server.
[20:59:52]   Keeping unit 06 in queue.
[20:59:52] Project: 5782 (Run 1, Clone 97, Gen 29)
[20:59:52] - Read packet limit of 540015616... Set to 524286976.


[20:59:52] + Attempting to send results [February 23 20:59:52 UTC]
[21:02:42] - Server does not have record of this unit. Will try again later.
[21:02:42] - Error: Could not transmit unit 06 (completed February 23) to work server.
[21:02:42] - Read packet limit of 540015616... Set to 524286976.


[21:02:42] + Attempting to send results [February 23 21:02:42 UTC]
[21:15:58] + Could not connect to Work Server (results)
[21:15:58]     (171.67.108.26:8080)
[21:15:58] + Retrying using alternative port
[21:15:59] - Couldn't send HTTP request to server
[21:15:59] + Could not connect to Work Server (results)
[21:15:59]     (171.67.108.26:80)
[21:15:59]   Could not transmit unit 06 to Collection server; keeping in queue.
[21:15:59] - Preparing to get new work unit...
[21:15:59] + Attempting to get work packet
[21:15:59] - Connecting to assignment server
[21:16:11] + Could not connect to Assignment Server
[21:16:13] - Successful: assigned to (171.67.108.11).
[21:16:13] + News From Folding@Home: Welcome to Folding@Home
[21:16:13] Loaded queue successfully.
[21:16:16] Project: 5782 (Run 1, Clone 97, Gen 29)
[21:16:16] - Read packet limit of 540015616... Set to 524286976.


[21:16:16] + Attempting to send results [February 23 21:16:16 UTC]
[21:16:45] + Results successfully sent
[21:16:45] Thank you for your contribution to Folding@Home.
[21:16:45] + Number of Units Completed: 432

[21:16:45] + Closed connections
[21:16:45] 
[21:16:45] + Processing work unit
[21:16:45] Core required: FahCore_11.exe
[21:16:45] Core found.
[21:16:45] Working on queue slot 07 [February 23 21:16:45 UTC]
[21:16:45] + Working ...

Posted: **Wed Feb 24, 2010 2:36 pm**

DrSpalding wrote: 2. Not uploaded, and server is broken such that the WUs are marked as done but not actually be done, leading to a loss of science until it is discovered.

Can somebody from the Pande Group tell if the above statement is true or not and if it is true, are they working on a solution for it?

Posted: **Wed Feb 24, 2010 5:06 pm**

VijayPande wrote:Here's an update. The WS's look to be in pretty good shape, but the CS code still has known issues and bad behavior that Joe is addressing. For what it's worth, it looks like all of this did expose several problems in the code which Joe has now fixed or is fixing, so I think this has hardened it considerably. I hope Joe will have a CS fix shortly (day or two), but it's too early to guarantee an ETA since he's still making sure he understands the failure mode completely.

Vijay, Thanks for the update and all the hard work that is going to into the fix for these problems.

Folding Forum

GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26