Page 1 of 2

general questions about WUs

Posted: Tue Feb 24, 2009 10:17 pm
by alpha754293
I have a few rather quick general questions about WUs:

1) how often or how many times does a WU gets resent out? (or how does the PandeGroup determine if and when they need to send it back out again?)

2) Where would I be able to find a table with what all of the different core stati meanings and definitions?

3) Why is it that sometimes the client would automatically download the same WU a number of times before it is able to properly start the run? (see Fahlog below).

4) When we post problems with the WUs here and some of the site admins/moderators report back with the number of times (if any) that particular WU shows up in the database, what it is supposed to mean? Is there any way to tell from within that database whether the returned WU was valid, or does it only stipulate that the WU has be returned for (where applicable) appropriate credit value?

Code: Select all

[16:28:25] - Warning: Could not delete all work unit files (5): Core file absent
[16:28:25] Trying to send all finished work units
[16:28:25] + No unsent completed units remaining.
[16:28:25] - Preparing to get new work unit...
[16:28:25] + Attempting to get work packet
[16:28:25] - Will indicate memory of 16003 MB
[16:28:25] - Connecting to assignment server
[16:28:25] Connecting to http://assign.stanford.edu:8080/
[16:28:25] Posted data.
[16:28:25] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[16:28:25] + News From Folding@Home: Welcome to Folding@Home
[16:28:25] Loaded queue successfully.
[16:28:25] Connecting to http://171.67.108.24:8080/
[16:28:32] Posted data.
[16:28:32] Initial: 0000; - Receiving payload (expected size: 4862689)
[16:28:45] - Downloaded at ~365 kB/s
[16:28:45] - Averaged speed for that direction ~339 kB/s
[16:28:45] + Received work.
[16:28:45] Trying to send all finished work units
[16:28:45] + No unsent completed units remaining.
[16:28:45] + Closed connections
[16:28:45] 
[16:28:45] + Processing work unit
[16:28:45] Core required: FahCore_a2.exe
[16:28:45] Core found.
[16:28:45] Working on queue slot 06 [February 24 16:28:45 UTC]
[16:28:45] + Working ...
[16:28:45] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 18269 -version 624'

[16:28:45] 
[16:28:45] *------------------------------*
[16:28:45] Folding@Home Gromacs SMP Core
[16:28:45] Version 2.04 (Thu Jan 29 16:43:57 PST 2009)
[16:28:45] 
[16:28:45] Preparing to commence simulation
[16:28:45] - Ensuring status. Please wait.
[16:28:46] Called DecompressByteArray: compressed_data_size=4862177 data_size=24067137, decompressed_data_size=24067137 diff=0
[16:28:46] - Digital signature verified
[16:28:46] 
[16:28:46] Project: 2676 (Run 1, Clone 121, Gen 12)
[16:28:46] 
[16:28:47] Assembly optimizations on if available.
[16:28:47] Entering M.D.
[16:28:53] Will resume from checkpoint file
[16:28:56] ng M.D.
[16:29:02] Will resume from checkpoint file
[16:29:05] fcCheckPointResume: file hashes different -- aborting.
[16:29:09] CoreStatus = FF (255)
[16:29:09] Sending work to server
[16:29:09] Project: 2676 (Run 1, Clone 121, Gen 12)
[16:29:09] - Error: Could not get length of results file work/wuresults_06.dat
[16:29:09] - Error: Could not read unit 06 file. Removing from queue.
[16:29:09] Trying to send all finished work units
[16:29:09] + No unsent completed units remaining.
[16:29:09] - Preparing to get new work unit...
[16:29:09] + Attempting to get work packet
[16:29:09] - Will indicate memory of 16003 MB
[16:29:09] - Connecting to assignment server
[16:29:09] Connecting to http://assign.stanford.edu:8080/
[16:29:10] Posted data.
[16:29:10] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[16:29:10] + News From Folding@Home: Welcome to Folding@Home
[16:29:10] Loaded queue successfully.
[16:29:10] Connecting to http://171.67.108.24:8080/
[16:29:16] Posted data.
[16:29:16] Initial: 0000; - Receiving payload (expected size: 4862689)
[16:29:26] - Downloaded at ~474 kB/s
[16:29:26] - Averaged speed for that direction ~366 kB/s
[16:29:26] + Received work.
[16:29:26] Trying to send all finished work units
[16:29:26] + No unsent completed units remaining.
[16:29:26] + Closed connections
[16:29:31] 
[16:29:31] + Processing work unit
[16:29:31] Core required: FahCore_a2.exe
[16:29:31] Core found.
[16:29:31] Working on queue slot 07 [February 24 16:29:31 UTC]
[16:29:31] + Working ...
[16:29:31] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 07 -checkpoint 15 -verbose -lifeline 18269 -version 624'

[16:29:32] 
[16:29:32] *------------------------------*
[16:29:32] Folding@Home Gromacs SMP Core
[16:29:32] Version 2.04 (Thu Jan 29 16:43:57 PST 2009)
[16:29:32] 
[16:29:32] Preparing to commence simulation
[16:29:32] - Ensuring status. Please wait.
[16:29:33] Called DecompressByteArray: compressed_data_size=4862177 data_size=24067137, decompressed_data_size=24067137 diff=0
[16:29:33] - Digital signature verified
[16:29:33] 
[16:29:33] Project: 2676 (Run 1, Clone 121, Gen 12)
[16:29:33] 
[16:29:33] Assembly optimizations on if available.
[16:29:33] Entering M.D.
[16:29:39] Will resume from checkpoint file
[16:29:43] ng M.D.
[16:29:49] Will resume from checkpoint file
[16:29:52] fcCheckPointResume: file hashes different -- aborting.
[16:29:56] CoreStatus = FF (255)
[16:29:56] Sending work to server
[16:29:56] Project: 2676 (Run 1, Clone 121, Gen 12)
[16:29:56] - Error: Could not get length of results file work/wuresults_07.dat
[16:29:56] - Error: Could not read unit 07 file. Removing from queue.
[16:29:56] Trying to send all finished work units
[16:29:56] + No unsent completed units remaining.
[16:29:56] - Preparing to get new work unit...
[16:29:56] + Attempting to get work packet
[16:29:56] - Will indicate memory of 16003 MB
[16:29:56] - Connecting to assignment server
[16:29:56] Connecting to http://assign.stanford.edu:8080/
[16:29:56] Posted data.
[16:29:56] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[16:29:56] + News From Folding@Home: Welcome to Folding@Home
[16:29:56] Loaded queue successfully.
[16:29:56] Connecting to http://171.67.108.24:8080/
[16:30:02] Posted data.
[16:30:02] Initial: 0000; - Receiving payload (expected size: 4862689)
[16:30:14] - Downloaded at ~395 kB/s
[16:30:14] - Averaged speed for that direction ~372 kB/s
[16:30:14] + Received work.
[16:30:14] Trying to send all finished work units
[16:30:14] + No unsent completed units remaining.
[16:30:14] + Closed connections
[16:30:19] 
[16:30:19] + Processing work unit
[16:30:19] Core required: FahCore_a2.exe
[16:30:19] Core found.
[16:30:19] Working on queue slot 08 [February 24 16:30:19 UTC]
[16:30:19] + Working ...
[16:30:19] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 08 -checkpoint 15 -verbose -lifeline 18269 -version 624'

[16:30:19] 
[16:30:19] *------------------------------*
[16:30:19] Folding@Home Gromacs SMP Core
[16:30:19] Version 2.04 (Thu Jan 29 16:43:57 PST 2009)
[16:30:19] 
[16:30:19] Preparing to commence simulation
[16:30:19] - Ensuring status. Please wait.
[16:30:28] - Looking at optimizations...
[16:30:28] - Working with standard loops on this execution.
[16:30:28] - Files status OK
[16:30:29] - Expanded 4862177 -> 24067137 (decompressed 494.9 percent)
[16:30:29] Called DecompressByteArray: compressed_data_size=4862177 data_size=24067137, decompressed_data_size=24067137 diff=0
[16:30:29] - Digital signature verified
[16:30:29] 
[16:30:29] Project: 2676 (Run 1, Clone 121, Gen 12)
[16:30:29] 
[16:30:30] Entering M.D.
[16:39:16] Completed 2500 out of 250000 steps  (1%)
[16:47:54] Completed 5000 out of 250000 steps  (2%)
[16:56:36] Completed 7500 out of 250000 steps  (3%)
[17:02:25] - Autosending finished units... [February 24 17:02:25 UTC]
[17:02:25] Trying to send all finished work units
[17:02:25] + No unsent completed units remaining.
[17:02:25] - Autosend completed
[17:05:18] Completed 10000 out of 250000 steps  (4%)
[17:13:59] Completed 12500 out of 250000 steps  (5%)
[17:22:38] Completed 15000 out of 250000 steps  (6%)
[17:31:19] Completed 17500 out of 250000 steps  (7%)
[17:40:02] Completed 20000 out of 250000 steps  (8%)
[17:48:46] Completed 22500 out of 250000 steps  (9%)
[17:57:31] Completed 25000 out of 250000 steps  (10%)
[18:06:18] Completed 27500 out of 250000 steps  (11%)
[18:15:06] Completed 30000 out of 250000 steps  (12%)
[18:23:54] Completed 32500 out of 250000 steps  (13%)
[18:32:42] Completed 35000 out of 250000 steps  (14%)
[18:41:30] Completed 37500 out of 250000 steps  (15%)
[18:50:18] Completed 40000 out of 250000 steps  (16%)
[18:59:08] Completed 42500 out of 250000 steps  (17%)
[19:07:58] Completed 45000 out of 250000 steps  (18%)
[19:16:47] Completed 47500 out of 250000 steps  (19%)
[19:25:35] Completed 50000 out of 250000 steps  (20%)
[19:34:23] Completed 52500 out of 250000 steps  (21%)
[19:43:08] Completed 55000 out of 250000 steps  (22%)
[19:51:52] Completed 57500 out of 250000 steps  (23%)
[20:00:36] Completed 60000 out of 250000 steps  (24%)
[20:09:20] Completed 62500 out of 250000 steps  (25%)
[20:18:03] Completed 65000 out of 250000 steps  (26%)
[20:26:47] Completed 67500 out of 250000 steps  (27%)
[20:35:32] Completed 70000 out of 250000 steps  (28%)
[20:44:16] Completed 72500 out of 250000 steps  (29%)
[20:53:01] Completed 75000 out of 250000 steps  (30%)
[21:01:44] Completed 77500 out of 250000 steps  (31%)
[21:10:23] Completed 80000 out of 250000 steps  (32%)
[21:19:02] Completed 82500 out of 250000 steps  (33%)
[21:27:42] Completed 85000 out of 250000 steps  (34%)
[21:36:23] Completed 87500 out of 250000 steps  (35%)
[21:45:05] Completed 90000 out of 250000 steps  (36%)
[21:53:48] Completed 92500 out of 250000 steps  (37%)
[22:02:31] Completed 95000 out of 250000 steps  (38%)
[22:11:14] Completed 97500 out of 250000 steps  (39%)
[22:19:57] Completed 100000 out of 250000 steps  (40%)

Re: general questions about WUs

Posted: Wed Feb 25, 2009 12:11 am
by DanGe
  1. I may have this wrong, but a WU gets sent out once. Servers decide to send it out again if the WU does not finish completely (e.g. EUE) or gets "lost" (e.g. user deletes it). If the WU passes the preferred deadline, the AS will send out the same WU too.
  2. You could check the wiki page http://fahwiki.net/index.php/CoreStatus_codes, but I don't see CoreStatus = FF. :?
  3. The client downloads the same WU a number of times if the AS wants it to. If your client trashes a WU, the AS reassigns that same WU in case you simply lost the WU. In your case, something appears to be corrupting your checkpoint file to cause your client to lose the WU.
  4. When some of the admins/mods report that, usually it is to find out how many times the WU EUE'd. If someone completed the WU, then the WU is valid (the client has ways to check for errors). If no one has completed the WU, it is deemed defective. Of course, if a WU EUE's and is returned, then partial credit is awarded (which you can tell from the database).

Re: general questions about WUs

Posted: Wed Feb 25, 2009 12:12 am
by toTOW
1) Theoretically, each WU is assigned once. Only special situations triggers a multiple assignments of a WU :

- preferred deadline has passed
- EUE/UM
- server bug or special configuration

2) Is it what your looking for : http://fahwiki.net/index.php/Cores :?:

3) This shouldn't happen. That might be a corrupted transfer (network side), a local issue on the machine (bad HHD ?) or a poorly generated WU (this is very unlikely to happen, but who knows).

4) We can't tell for sure that the returned result are valid, but when a WU is credited for full credit, it is safe to assume it is valid (there are many safeties in the core, and server software that will discard bad WUs). Every other situations (duplicate results from the same machine, EUE, UM, ...) will result in no or partial credit to the donor.

Re: general questions about WUs

Posted: Wed Feb 25, 2009 12:15 am
by toTOW
I think DanGe gave you the right answer for 2) ;)

The FF status is a A2 specific error, and if my memories are correct (and I think they are because of this message : [16:29:05] fcCheckPointResume: file hashes different -- aborting.), it's a checkpoint error.

Re: general questions about WUs

Posted: Wed Feb 25, 2009 12:23 am
by alpha754293
1) I'm asking because I see that in my logs, sometimes there are errors that my system is apparently coming across that I'm not necessarily picking up on. Although it seems like the client is quite capable of handling those types of errors and multiple reassignments as necessary between the AS and my client, still; I would think/prefer that such errors never shows up in the first place.

2) Yes, DanGe did what I was looking for. Hmmm. Bummer that it isn't sorted numerically.

3) If there can be a corrupted WU download, that means that it is also quite possible that there could be a corrupted upload right? From the client side, even though it would state that it has sent it successfully, wouldn't it also be possible that it could have been corrupt during the transmission? I would think that data corruption during the transfer can work both ways, and since the client doesn't verify the data that's received on the server before purging the results, (unlike when it's downloading where if the WU is corrupted in some way, it would try and pick it up again (or a different WU as determined by the AS).

4) Like I said, I was just curious because I have been looking into the Fah logs a bit more lately and started seeing some of these errors that are popping up so I was just wondering about them.

Re: general questions about WUs

Posted: Wed Feb 25, 2009 12:29 am
by toTOW
The server also check the results file integrity. If an upload can't be used, it will answer with an error instead of the usual "[09:04:45] + Results successfully sent [09:04:45] Thank you for your contribution to Folding@Home." messages. For example, I already saw a "Digital signature don't match" error.

If you get this kind of error, we won't see it in the DB : the server discard the upload and wait for a next upload attempt.

Re: general questions about WUs

Posted: Wed Feb 25, 2009 12:31 am
by alpha754293
ahh....okay. so it will try to resend the results, or hold it if the number of attempts exceeds some threshold, and then it'll start working on the next WU, and autosend it at some point. okay. thanks.

is there a way to minimize these errors that I've been getting, with known tested good hardware and network connection? or is this more along the lines of "nature of the beast", let it handle/take care of it automatically?

Re: general questions about WUs

Posted: Wed Feb 25, 2009 8:26 am
by ^w^ing
For your number 3 and your fahlog, there is something more into it imo. Your client just downloaded the WU and it tried to resume from a checkpoint straight away. There has been an observed issue when a WU is trashed somehow and the checkpoint files of that WU stay in the work folder, after the client cycles all the queue slots and downloads a new WU into the same slot for which there are stray checkpoint files left, it tries to resume from it and fails (there were cases when it actually did resume, I think it had to download a WU from the same project as the one that failed, to actually resume from the wrong checkpoint). And since the current A2 core seems to have problems with checkpoints and it often doesnt resume AND after it doesnt resume it causes the checkpoint of that WU to stay in the work folder, I would say this is the case.

I suggest you to prune the work folder - delete all the files that belong to the currently empty queue slots of your client.

Re: general questions about WUs

Posted: Wed Feb 25, 2009 11:28 am
by MtM
^w^ing wrote:For your number 3 and your fahlog, there is something more into it imo. Your client just downloaded the WU and it tried to resume from a checkpoint straight away. There has been an observed issue when a WU is trashed somehow and the checkpoint files of that WU stay in the work folder, after the client cycles all the queue slots and downloads a new WU into the same slot for which there are stray checkpoint files left, it tries to resume from it and fails (there were cases when it actually did resume, I think it had to download a WU from the same project as the one that failed, to actually resume from the wrong checkpoint). And since the current A2 core seems to have problems with checkpoints and it often doesnt resume AND after it doesnt resume it causes the checkpoint of that WU to stay in the work folder, I would say this is the case.

I suggest you to prune the work folder - delete all the files that belong to the currently empty queue slots of your client.
+1 slots 6 and seven probably are unuseable untill you do. It's also curieus to see you got EUE's right up till the point the client drops back to standard loops ( eg no optimizations ) idk what to think about that?

Also, what project finished before your log snippet?
[16:28:25] - Warning: Could not delete all work unit files (5): Core file absent
Might have something to do with the queu corruption issue.

Re: general questions about WUs

Posted: Wed Feb 25, 2009 5:32 pm
by alpha754293
MtM wrote:
^w^ing wrote:For your number 3 and your fahlog, there is something more into it imo. Your client just downloaded the WU and it tried to resume from a checkpoint straight away. There has been an observed issue when a WU is trashed somehow and the checkpoint files of that WU stay in the work folder, after the client cycles all the queue slots and downloads a new WU into the same slot for which there are stray checkpoint files left, it tries to resume from it and fails (there were cases when it actually did resume, I think it had to download a WU from the same project as the one that failed, to actually resume from the wrong checkpoint). And since the current A2 core seems to have problems with checkpoints and it often doesnt resume AND after it doesnt resume it causes the checkpoint of that WU to stay in the work folder, I would say this is the case.

I suggest you to prune the work folder - delete all the files that belong to the currently empty queue slots of your client.
+1 slots 6 and seven probably are unuseable untill you do. It's also curieus to see you got EUE's right up till the point the client drops back to standard loops ( eg no optimizations ) idk what to think about that?

Also, what project finished before your log snippet?
[16:28:25] - Warning: Could not delete all work unit files (5): Core file absent
Might have something to do with the queu corruption issue.
Here's the log from the previously fully completed WU:

Code: Select all

[16:21:18] Completed 250000 out of 250000 steps  (100%)
[16:21:20] DynamicWrapper: Finished Work Unit: sleep=1000
[16:21:21] 
[16:21:21] Finished Work Unit:
[16:21:21] - Reading up to 21127248 from "work/wudata_05.trr": Read 21127248
[16:21:21] trr file hash check passed.
[16:21:21] - Reading up to 4502936 from "work/wudata_05.xtc": Read 4502936
[16:21:21] xtc file hash check passed.
[16:21:21] edr file hash check passed.
[16:21:21] logfile size: 175884
[16:21:21] Leaving Run
[16:21:24] Done with run, master node
[16:21:24] - Writing 25987972 bytes of core data to disk...
[16:21:24]   ... Done.
[16:21:28] - Shutting down core
[16:21:28] 
[16:21:28] Folding@home Core Shutdown: FINISHED_UNIT
[16:24:44] CoreStatus = 64 (100)
[16:24:44] Unit 5 finished with 79 percent of time to deadline remaining.
[16:24:44] Updated performance fraction: 0.781968
[16:24:44] Sending work to server
[16:24:44] Project: 2669 (Run 5, Clone 42, Gen 91)


[16:24:44] + Attempting to send results [February 24 16:24:44 UTC]
[16:24:44] - Reading file work/wuresults_05.dat from core
[16:24:44]   (Read 25987972 bytes from disk)
[16:24:44] Connecting to http://171.64.65.56:8080/
[16:28:07] Posted data.
[16:28:07] Initial: 0000; - Uploaded at ~115 kB/s
[16:28:23] - Averaged speed for that direction ~127 kB/s
[16:28:23] + Results successfully sent
[16:28:23] Thank you for your contribution to Folding@Home.
[16:28:23] + Number of Units Completed: 40
I just find this behavior a tad odd because the client runs largely on a dedicated system, and as such, also runs largely uninterrupted and unsupervised. All of the hardware on the system is known good but I'm still getting these errors anyways.

Re: general questions about WUs

Posted: Wed Feb 25, 2009 5:58 pm
by ^w^ing
Just do what I suggested and you should be fine for a long time.

Dont know about the

Code: Select all

[16:28:25] - Warning: Could not delete all work unit files (5): Core file absent
- I think I have that every WU, seems like a cosmetic issue to me as it doesnt cause any serious problems.

Re: general questions about WUs

Posted: Wed Feb 25, 2009 5:59 pm
by alpha754293
^w^ing wrote:Just do what I suggested and you should be fine for a long time then.
Dont know about the

Code: Select all

[16:28:25] - Warning: Could not delete all work unit files (5): Core file absent
- I think I have that every WU, seems like a cosmetic issue to me as it doesnt cause any serious problems.
I'll probably prune the system once the current WUs are finished.

Re: general questions about WUs

Posted: Wed Feb 25, 2009 6:12 pm
by ^w^ing
I dont think that is really necessary just because of this :D Just delete all the files that doesnt match the current WU's slot number... Or delete the work folder inbetween the WUs.

Re: general questions about WUs

Posted: Wed Feb 25, 2009 6:15 pm
by alpha754293
^w^ing wrote:I dont think that is really necessary just because of this :D Just delete all the files that doesnt match the current WU's slot number... Or delete the work folder inbetween the WUs.
Well, see, I don't know which slots are good and which ones are bad. Not until/unless the core rolls over them.

I was just thinking of deleting pretty much the entire client directory, and then bringing in the fah6 and mpiexec back in, so that to the system, it'll look like it's starting fresh.

Re: general questions about WUs

Posted: Wed Feb 25, 2009 6:22 pm
by bruce
alpha754293 wrote:Well, see, I don't know which slots are good and which ones are bad. Not until/unless the core rolls over them.
All slots are good or bad depending entirely on which files were not cleaned up after some earlier EUE. There is always one active slot and when that WU is running, don't mess with it. There MAY also be completed WUs that have not yet been uploaded . . . containing a file known as "wuresults_*.dat" Any files with other numbers after the underscore are unnecessary.

There's no particular reason to clean out the install folder.