Project: 2665 (Run 2, Clone 264, Gen 1)
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 33
- Joined: Sun May 25, 2008 7:40 pm
Project: 2665 (Run 2, Clone 264, Gen 1)
I have a machine that completed this work unit and returned it with about 82% time remaining before the deadline. It immediately downloaded the same WU and is crunching away on it. What's up with that?
-
- Posts: 33
- Joined: Sun May 25, 2008 7:40 pm
Re: Project: 2665 (Run 2, Clone 264, Gen 1)
Sure thing. I can also provide the output from qd.exe that shows the WU is identical, if necessary:
Code: Select all
[07:24:57] - Preparing to get new work unit...
[07:24:57] + Attempting to get work packet
[07:24:57] - Will indicate memory of 2046 MB
[07:24:57] - Connecting to assignment server
[07:24:57] Connecting to http://assign.stanford.edu:8080/
[07:24:57] Posted data.
[07:24:57] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[07:24:57] + News From Folding@Home: Welcome to Folding@Home
[07:24:58] Loaded queue successfully.
[07:24:58] Connecting to http://171.64.65.64:8080/
[07:25:03] Posted data.
[07:25:03] Initial: 0000; - Receiving payload (expected size: 4812470)
[07:25:11] - Downloaded at ~587 kB/s
[07:25:11] - Averaged speed for that direction ~492 kB/s
[07:25:11] + Received work.
[07:25:11] Trying to send all finished work units
[07:25:11] + No unsent completed units remaining.
[07:25:11] + Closed connections
[07:25:11]
[07:25:11] + Processing work unit
[07:25:11] Core required: FahCore_a1.exe
[07:25:11] Core found.
[07:25:11] Working on Unit 00 [May 24 07:25:11]
[07:25:11] + Working ...
[07:25:11] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suffix 00 -checkpoint 5 -forceasm -verbose -lifeline 2212 -version 591'
[07:25:11]
[07:25:11] *------------------------------*
[07:25:11] Folding@Home Gromacs SMP Core
[07:25:11] Version 1.74 (March 10, 2007)
[07:25:11]
[07:25:11] Preparing to commence simulation
[07:25:11] - Ensuring status. Please wait.
[07:25:28] - Assembly optimizations manually forced on.
[07:25:28] - Not checking prior termination.
[07:25:43] - Expanded 4811958 -> 24810145 (decompressed 515.5 percent)
[07:25:43] - Starting from initial work packet
[07:25:43]
[07:25:43] Project: 2665 (Run 2, Clone 264, Gen 1)
[07:25:43]
[07:25:44] Assembly optimizations on if available.
[07:25:44] Entering M.D.
[07:25:51] Rejecting checkpoint
[07:25:53] Protein: HGG with glycosylations
[07:25:53] Writing local files
[07:26:02] Extra SSE boost OK.
[07:26:03] Writing local files
[07:26:03] Completed 0 out of 250000 steps (0 percent)
[07:31:04] Timered checkpoint triggered.
[07:36:04] Timered checkpoint triggered.
[07:41:04] Timered checkpoint triggered.
[07:41:53] Writing local files
[07:41:53] Completed 2500 out of 250000 steps (1 percent)
[07:46:54] Timered checkpoint triggered.
[07:51:53] Timered checkpoint triggered.
[07:56:54] Timered checkpoint triggered.
[07:57:43] Writing local files
[07:57:43] Completed 5000 out of 250000 steps (2 percent)
[08:02:44] Timered checkpoint triggered.
[08:07:44] Timered checkpoint triggered.
[08:12:45] Timered checkpoint triggered.
[08:13:34] Writing local files
[08:13:34] Completed 7500 out of 250000 steps (3 percent)
[08:18:34] Timered checkpoint triggered.
[08:23:35] Timered checkpoint triggered.
[08:28:36] Timered checkpoint triggered.
[08:29:25] Writing local files
[08:29:25] Completed 10000 out of 250000 steps (4 percent)
[08:34:26] Timered checkpoint triggered.
[08:39:27] Timered checkpoint triggered.
[08:44:28] Timered checkpoint triggered.
[08:45:17] Writing local files
[08:45:17] Completed 12500 out of 250000 steps (5 percent)
.
.
.
[09:23:22] Timered checkpoint triggered.
[09:28:23] Timered checkpoint triggered.
[09:33:23] Timered checkpoint triggered.
[09:34:12] Writing local files
[09:34:12] Completed 247500 out of 250000 steps (99 percent)
[09:39:12] Timered checkpoint triggered.
[09:44:12] Timered checkpoint triggered.
[09:49:12] Timered checkpoint triggered.
[09:50:02] Writing local files
[09:50:02] Completed 250000 out of 250000 steps (100 percent)
[09:50:02] Writing final coordinates.
[09:50:03] Past main M.D. loop
[09:50:03] Will end MPI now
[09:51:03]
[09:51:03] Finished Work Unit:
[09:51:03] - Reading up to 21421872 from "work/wudata_00.arc": Read 21421872
[09:51:03] - Reading up to 591876 from "work/wudata_00.xtc": Read 591876
[09:51:03] goefile size: 0
[09:51:03] logfile size: 203294
[09:51:03] Leaving Run
[09:51:07] - Writing 22223414 bytes of core data to disk...
[09:51:08] ... Done.
[09:51:08] - Failed to delete work/wudata_00.sas
[09:51:08] - Failed to delete work/wudata_00.goe
[09:51:08] Warning: check for stray files
[09:51:08] - Shutting down core
[09:53:08]
[09:53:08] Folding@home Core Shutdown: FINISHED_UNIT
[09:53:08]
[09:53:08] Folding@home Core Shutdown: FINISHED_UNIT
[09:53:12] CoreStatus = 64 (100)
[09:53:12] Unit 0 finished with 82 percent of time to deadline remaining.
[09:53:12] Updated performance fraction: 0.822321
[09:53:12] Sending work to server
[09:53:12] + Attempting to send results
[09:53:12] - Reading file work/wuresults_00.dat from core
[09:53:12] (Read 22223414 bytes from disk)
[09:53:12] Connecting to http://171.64.65.64:8080/
[09:53:13] - Couldn't send HTTP request to server
[09:53:13] + Could not connect to Work Server (results)
[09:53:13] (171.64.65.64:8080)
[09:53:13] - Error: Could not transmit unit 00 (completed May 25) to work server.
[09:53:13] - 1 failed uploads of this unit.
[09:53:13] Keeping unit 00 in queue.
[09:53:13] Trying to send all finished work units
[09:53:13] + Attempting to send results
[09:53:13] - Reading file work/wuresults_00.dat from core
[09:53:13] (Read 22223414 bytes from disk)
[09:53:13] Connecting to http://171.64.65.64:8080/
[09:53:15] - Couldn't send HTTP request to server
[09:53:15] + Could not connect to Work Server (results)
[09:53:15] (171.64.65.64:8080)
[09:53:15] - Error: Could not transmit unit 00 (completed May 25) to work server.
[09:53:15] - 2 failed uploads of this unit.
[09:53:15] + Attempting to send results
[09:53:15] - Reading file work/wuresults_00.dat from core
[09:53:15] (Read 22223414 bytes from disk)
[09:53:15] Connecting to http://171.64.122.86:8080/
[09:57:46] Posted data.
[09:57:46] Initial: 0000; - Uploaded at ~80 kB/s
[09:57:46] - Averaged speed for that direction ~77 kB/s
[09:57:46] + Results successfully sent
[09:57:46] Thank you for your contribution to Folding@Home.
[09:57:46] + Number of Units Completed: 495
[09:57:46] Successfully sent unit 00 to Collection server.
[09:57:46] + Sent 1 of 1 completed units to the server
[09:57:46] - Preparing to get new work unit...
[09:57:46] + Attempting to get work packet
[09:57:46] - Will indicate memory of 2046 MB
[09:57:46] - Connecting to assignment server
[09:57:46] Connecting to http://assign.stanford.edu:8080/
[09:57:46] Posted data.
[09:57:46] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[09:57:46] + News From Folding@Home: Welcome to Folding@Home
[09:57:46] Loaded queue successfully.
[09:57:46] Connecting to http://171.64.65.64:8080/
[09:57:52] Posted data.
[09:57:52] Initial: 0000; - Receiving payload (expected size: 4812470)
[09:58:00] - Downloaded at ~587 kB/s
[09:58:00] - Averaged speed for that direction ~511 kB/s
[09:58:00] + Received work.
[09:58:00] Trying to send all finished work units
[09:58:00] + No unsent completed units remaining.
[09:58:00] + Closed connections
[09:58:00]
[09:58:00] + Processing work unit
[09:58:00] Core required: FahCore_a1.exe
[09:58:00] Core found.
[09:58:00] Working on Unit 01 [May 25 09:58:00]
[09:58:00] + Working ...
[09:58:00] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suffix 01 -checkpoint 5 -forceasm -verbose -lifeline 2212 -version 591'
[09:58:00]
[09:58:00] *------------------------------*
[09:58:00] Folding@Home Gromacs SMP Core
[09:58:00] Version 1.74 (March 10, 2007)
[09:58:00]
[09:58:00] Preparing to commence simulation
[09:58:00] - Ensuring status. Please wait.
[09:58:05] - Starting from initial work packet
[09:58:05]
[09:58:05] Project: 2665 (Run 2, Clone 264, Gen 1)
[09:58:05]
[09:58:06] Assembly optimizations on if available.
[09:58:06] Entering M.D.
[09:58:30] on if available.
[09:58:30] Entering M.D.
[09:58:36] Rejecting checkpoint
[09:58:38] Protein: HGG with glycosylations
[09:58:38] Writing local files
[09:58:48] Extra SSE boost OK.
[09:58:48] Writing local files
[09:58:48] Completed 0 out of 250000 steps (0 percent)
[10:03:48] Timered checkpoint triggered.
[10:08:49] Timered checkpoint triggered.
[10:13:50] Timered checkpoint triggered.
[10:14:38] Writing local files
[10:14:38] Completed 2500 out of 250000 steps (1 percent)
[10:19:39] Timered checkpoint triggered.
[10:24:40] Timered checkpoint triggered.
[10:29:41] Timered checkpoint triggered.
[10:30:30] Writing local files
[10:30:30] Completed 5000 out of 250000 steps (2 percent)
[10:35:31] Timered checkpoint triggered.
[10:40:32] Timered checkpoint triggered.
[10:45:32] Timered checkpoint triggered.
[10:46:20] Writing local files
[10:46:20] Completed 7500 out of 250000 steps (3 percent)
[10:51:21] Timered checkpoint triggered.
[10:56:22] Timered checkpoint triggered.
[11:01:23] Timered checkpoint triggered.
[11:02:12] Writing local files
[11:02:12] Completed 10000 out of 250000 steps (4 percent)
[11:07:13] Timered checkpoint triggered.
[11:12:14] Timered checkpoint triggered.
[11:17:14] Timered checkpoint triggered.
[11:18:03] Writing local files
[11:18:04] Completed 12500 out of 250000 steps (5 percent)
[11:23:05] Timered checkpoint triggered.
[11:28:05] Timered checkpoint triggered.
[11:33:06] Timered checkpoint triggered.
[11:33:54] Writing local files
Re: Project: 2665 (Run 2, Clone 264, Gen 1)
Thanks for your post. We do have some redundancy in our system, so that isn't a problem. I've alerted the researcher in charge of this project so he can double check that everything is okay on our end.
Relly
Relly
Re: Project: 2665 (Run 2, Clone 264, Gen 1)
Here's what happened:
It looks like your client returned the WU to the collection server and then contacted the work server for another WU before the collection server and work server had a chance to sync up. So the work server thought you were still working on the same WU and re-issued it to you.
It looks like your client returned the WU to the collection server and then contacted the work server for another WU before the collection server and work server had a chance to sync up. So the work server thought you were still working on the same WU and re-issued it to you.
-
- Posts: 33
- Joined: Sun May 25, 2008 7:40 pm
Re: Project: 2665 (Run 2, Clone 264, Gen 1)
Well, you know, it seems that this could happen regularly if that is what happened. I mean, you turn work in and ask for more. Since I spent another day doing the same WU, do I get credit for it twice, or do I now have to start monitoring my boxes to keep from getting duplicate work? The points on the 2665 series are abysmal anyway without getting dinged by doing the same work twice...but none are worth doing twice and only getting credited once.
Re: Project: 2665 (Run 2, Clone 264, Gen 1)
It can only happen when your client can't connect to the work server one moment (when it tries to return work) and then can connect the next (when it downloads a new work unit). If there is more of a substantial lag, the collection server will be able to transmit the files to the work server in the interval.
We're also thinking about a work-server-side improvement that may help things; however, we'll have to test it out. The fundamental issue of having several fail-safes that might not have time to sync with each other is harder to solve.
We're also thinking about a work-server-side improvement that may help things; however, we'll have to test it out. The fundamental issue of having several fail-safes that might not have time to sync with each other is harder to solve.
-
- Posts: 33
- Joined: Sun May 25, 2008 7:40 pm
Re: Project: 2665 (Run 2, Clone 264, Gen 1)
And, the question about credit for the 2nd iteration remains.
Also, as you can see from this:
[09:57:46] Posted data.
[09:57:46] Initial: 0000; - Uploaded at ~80 kB/s
[09:57:46] - Averaged speed for that direction ~77 kB/s
[09:57:46] + Results successfully sent
[09:57:46] Thank you for your contribution to Folding@Home.
[09:57:46] + Number of Units Completed: 495
[09:57:46] Successfully sent unit 00 to Collection server.
[09:57:46] + Sent 1 of 1 completed units to the server
[09:57:46] - Preparing to get new work unit...
[09:57:46] + Attempting to get work packet
[09:57:46] - Will indicate memory of 2046 MB
[09:57:46] - Connecting to assignment server
[09:57:46] Connecting to http://assign.stanford.edu:8080/
[09:57:46] Posted data.
[09:57:46] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[09:57:46] + News From Folding@Home: Welcome to Folding@Home
[09:57:46] Loaded queue successfully.
It sent the completed WU before connecting to the assignment server.
Also, as you can see from this:
[09:57:46] Posted data.
[09:57:46] Initial: 0000; - Uploaded at ~80 kB/s
[09:57:46] - Averaged speed for that direction ~77 kB/s
[09:57:46] + Results successfully sent
[09:57:46] Thank you for your contribution to Folding@Home.
[09:57:46] + Number of Units Completed: 495
[09:57:46] Successfully sent unit 00 to Collection server.
[09:57:46] + Sent 1 of 1 completed units to the server
[09:57:46] - Preparing to get new work unit...
[09:57:46] + Attempting to get work packet
[09:57:46] - Will indicate memory of 2046 MB
[09:57:46] - Connecting to assignment server
[09:57:46] Connecting to http://assign.stanford.edu:8080/
[09:57:46] Posted data.
[09:57:46] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[09:57:46] + News From Folding@Home: Welcome to Folding@Home
[09:57:46] Loaded queue successfully.
It sent the completed WU before connecting to the assignment server.
-
- Pande Group Member
- Posts: 2058
- Joined: Fri Nov 30, 2007 6:25 am
- Location: Stanford
Re: Project: 2665 (Run 2, Clone 264, Gen 1)
The server side modification Peter talked about will give credit to those for which this happens. Right now, we have several checks to prevent cheating and Sick Willie got caught in one incorrectly.
-
- Posts: 33
- Joined: Sun May 25, 2008 7:40 pm
Re: Project: 2665 (Run 2, Clone 264, Gen 1)
I am confused now. Cheating by asking for another WU before the collection server reported the return of the last one? I've had WU's stay in queue for long periods of time while the machine worked on (a) new one(s). Following the logic in this thread, I should always get the same WU if the collection server hasn't sync'd with the work server or the WU hasn't uploaded for some reason. And, we know that's not the way it works. Or, at least it never has worked that way in the past.
Re: Project: 2665 (Run 2, Clone 264, Gen 1)
It's a little more complicated than that, but nothing has changed recently (other than having a functional collection server for SMP).
Part of the function of this code is to prevent a violently malfunctioning client from sucking up all the available work units on a server (think serial EUE's at step 0). It's not perfect, and part of the imperfection came to light here. We're working on a better solution.
Part of the function of this code is to prevent a violently malfunctioning client from sucking up all the available work units on a server (think serial EUE's at step 0). It's not perfect, and part of the imperfection came to light here. We're working on a better solution.
-
- Posts: 33
- Joined: Sun May 25, 2008 7:40 pm
Re: Project: 2665 (Run 2, Clone 264, Gen 1)
Cool. So, I guess the question, still unanswered, has to do with the credit for this specific WU (the 2nd one, which used a a duplicate 26 hours of computer time and electricity). 

-
- Posts: 1037
- Joined: Sun Dec 02, 2007 3:47 pm
- Location: Colorado @ 10,000 feet
Re: Project: 2665 (Run 2, Clone 264, Gen 1)
Hi Sick_Willie (team 734),
Your WU (P2665 R2 C264 G1) was added to the stats database on 2008-05-25 04:16:57 for 1920 points of credit.
Hi Sick_Willie (team 734),
Your WU (P2665 R2 C264 G1) was added to the stats database on 2008-05-26 06:17:13 for 0 points of credit.
Your WU (P2665 R2 C264 G1) was added to the stats database on 2008-05-25 04:16:57 for 1920 points of credit.
Hi Sick_Willie (team 734),
Your WU (P2665 R2 C264 G1) was added to the stats database on 2008-05-26 06:17:13 for 0 points of credit.
Re: Project: 2665 (Run 2, Clone 264, Gen 1)
How about posting an additional page prior to the part of FAHlog that you did post. What happened when it tried to upload to the Work Server?sick willie wrote:It sent the completed WU before connecting to the assignment server.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 33
- Joined: Sun May 25, 2008 7:40 pm
Re: Project: 2665 (Run 2, Clone 264, Gen 1)
See post 3 above. I only excerpted part of it in a later post to show that nothing had been done any differently than on any other WU.bruce wrote:How about posting an additional page prior to the part of FAHlog that you did post. What happened when it tried to upload to the Work Server?sick willie wrote:It sent the completed WU before connecting to the assignment server.
I'd like some clarification from Pande Group as to whether I should expect credit for the 2nd running or not. After all, even once noticing that it was doing the same WU, I let it go to completion. Again.