Assignment server goof up?

BrokenWolf · Post by **BrokenWolf** » Tue Oct 06, 2009 1:01 am

I have 2 separate physical systems that are running the same exact WU. p2671, R56/C72/G107. They were assigned these WU's within 18 minutes of each other (over 5 hours ago). They have different Client IDs in the FAHlog.txt Should I trash one and let is start on another (if the assignment server lets it)?

Code: Select all

[19:07:38] Initial: 0000; - Receiving payload (expected size: 4833821)
[19:07:40] - Downloaded at ~2360 kB/s
[19:07:40] - Averaged speed for that direction ~3363 kB/s
[19:07:40] + Received work.
[19:07:40] Trying to send all finished work units
[19:07:40] + No unsent completed units remaining.
[19:07:40] + Closed connections
[19:07:40] 
[19:07:40] + Processing work unit
[19:07:40] Core required: FahCore_a2.exe
[19:07:40] Core found.
[19:07:40] Working on queue slot 07 [October 5 19:07:40 UTC]
[19:07:40] + Working ...
[19:07:40] - Calling './mpiexec -np 8 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 07 -priority 96 -checkpoint 15 -verbose -lifeline 7874 -version 624'

[19:07:40] 
[19:07:40] *------------------------------*
[19:07:40] Folding@Home Gromacs SMP Core
[19:07:40] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[19:07:40] 
[19:07:40] Preparing to commence simulation
[19:07:40] - Ensuring status. Please wait.
[19:07:41] Called DecompressByteArray: compressed_data_size=4833309 data_size=24033973, decompressed_data_size=24033973 diff=0
[19:07:41] - Digital signature verified
[19:07:41] 
[19:07:41] Project: 2671 (Run 56, Clone 72, Gen 107)
[19:07:41] 
[19:07:41] Assembly optimizations on if available.
[19:07:41] Entering M.D.
[19:07:51] un 56, Clone 72, Gen 107)
[19:07:51] 
[19:07:51] Entering M.D.
[19:12:29] pleted 2500 out of 250000 steps  (1%)

From the other system:

Code: Select all

[19:23:11] Initial: 0000; - Receiving payload (expected size: 4833821)
[19:23:14] - Downloaded at ~1573 kB/s
[19:23:14] - Averaged speed for that direction ~1423 kB/s
[19:23:14] + Received work.
[19:23:14] Trying to send all finished work units
[19:23:14] + No unsent completed units remaining.
[19:23:14] + Closed connections
[19:23:14] 
[19:23:14] + Processing work unit
[19:23:14] Core required: FahCore_a2.exe
[19:23:14] Core found.
[19:23:14] Working on queue slot 06 [October 5 19:23:14 UTC]
[19:23:14] + Working ...
[19:23:14] - Calling './mpiexec -np 8 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 06 -priority 96 -checkpoint 15 -verbose -lifeline 24175 -version 624'

[19:23:14] 
[19:23:14] *------------------------------*
[19:23:14] Folding@Home Gromacs SMP Core
[19:23:14] Version 2.09 (Sun Aug 30 03:43:28 CEST 2009)
[19:23:14] 
[19:23:14] Preparing to commence simulation
[19:23:14] - Ensuring status. Please wait.
[19:23:14] Files status OK
[19:23:15] - Expanded 4833309 -> 24033973 (decompressed 497.2 percent)
[19:23:15] Called DecompressByteArray: compressed_data_size=4833309 data_size=24033973, decompressed_data_size=24033973 diff=0
[19:23:15] - Digital signature verified
[19:23:15] 
[19:23:15] Project: 2671 (Run 56, Clone 72, Gen 107)
[19:23:15] 
[19:23:15] Assembly optimizations on if available.
[19:23:15] Entering M.D.
[19:23:25] un 56, Clone 72, Gen 107)
[19:23:25] 
[19:23:25] Entering M.D.
[19:27:56] pleted 2500 out of 250000 steps  (1%)

BW

7im · Post by **7im** » Tue Oct 06, 2009 5:15 am

To answer your thread title question, probably not, no. When you get credit for both work units, you will know for sure.

max · Post by **max** » Tue Oct 06, 2009 1:29 pm

I have 2 Linux SMP clients crunching the same WU (P2671 (R6, C31, G108)) and 2 GPU clients with the same WU (P5771 (R5, C54, G372)) currently. The SMP clients are on completely different machines, GPUs have different machine IDs. User IDs as reported by the log file are completely different. No cloning involved. I've seen similar behavior before. What gives?

7im · Post by **7im** » Tue Oct 06, 2009 4:38 pm

Under rare circumstances (of which I know very little about), work units with the same PRCG info ARE sent out to more than one client.

One example I have seen in the past (and this is my interpretation of the events), they needed some results back urgently for a deadline, and so they sent out the work units more than once to make sure they got the results back, and got them back quickly. Folding the WU on more than one client, the chances of getting a faster computer, and thus a faster return, increases.

But as a general rule, the project does not send out the same work unit multiple times. But I would guess that at times, if a server goes down, the records of which work unit were sent out, and or received, might be lost, and so the WUs are sent out again. Naturally, this doesn't explain the above examples, IMO, but this is not unexpected, just like an occasional EUE is not unexpected.

And as long as you get credit for both work units, rare examples shouldn't be a big concern (IMO).

bollix47 · Post by **bollix47** » Tue Oct 06, 2009 6:02 pm

Same situation here on two different computer:

Code: Select all

[13:18:43] Connecting to http://171.67.108.24:8080/
[13:18:49] Posted data.
[13:18:49] Initial: 0000; - Receiving payload (expected size: 4838995)
[13:18:54] - Downloaded at ~945 kB/s
[13:18:54] - Averaged speed for that direction ~1048 kB/s
[13:18:54] + Received work.
[13:18:54] Trying to send all finished work units
[13:18:54] + No unsent completed units remaining.
[13:18:54] + Closed connections
[13:18:54] 
[13:18:54] + Processing work unit
[13:18:54] At least 4 processors must be requested.Core required: FahCore_a2.exe
[13:18:54] Core found.
[13:18:54] Working on queue slot 00 [October 6 13:18:54 UTC]
[13:18:54] + Working ...
[13:18:54] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 00 -checkpoint 30 -verbose -lifeline 6596 -version 624'

[13:18:54] 
[13:18:54] *------------------------------*
[13:18:54] Folding@Home Gromacs SMP Core
[13:18:54] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[13:18:54] 
[13:18:54] Preparing to commence simulation
[13:18:54] - Ensuring status. Please wait.
[13:18:55] Called DecompressByteArray: compressed_data_size=4838483 data_size=24038369, decompressed_data_size=24038369 diff=0
[13:18:55] - Digital signature verified
[13:18:55] 
[13:18:55] Project: 2671 (Run 11, Clone 74, Gen 108)
[13:18:55] 
[13:18:55] Assembly optimizations on if available.
[13:18:55] Entering M.D.
[13:19:05] un 11, Clone 74, Gen 108)
[13:19:05] 
[13:19:05] Entering M.D.

Code: Select all

[13:27:35] - Successful: assigned to (171.67.108.24).
[13:27:35] + News From Folding@Home: Welcome to Folding@Home
[13:27:35] Loaded queue successfully.
[13:27:46] + Closed connections
[13:27:46] 
[13:27:46] + Processing work unit
[13:27:46] Core required: FahCore_a2.exe
[13:27:46] Core found.
[13:27:46] Working on queue slot 04 [October 6 13:27:46 UTC]
[13:27:46] + Working ...
[13:27:46] 
[13:27:46] *------------------------------*
[13:27:46] Folding@Home Gromacs SMP Core
[13:27:46] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[13:27:46] 
[13:27:46] Preparing to commence simulation
[13:27:46] - Ensuring status. Please wait.
[13:27:56] - Assembly optimizations manually forced on.
[13:27:56] - Not checking prior termination.
[13:27:57] - Expanded 4838483 -> 24038369 (decompressed 496.8 percent)
[13:27:57] Called DecompressByteArray: compressed_data_size=4838483 data_size=24038369, decompressed_data_size=24038369 diff=0
[13:27:57] - Digital signature verified
[13:27:57] 
[13:27:57] Project: 2671 (Run 11, Clone 74, Gen 108)
[13:27:57] 
[13:27:57] Assembly optimizations on if available.
[13:27:57] Entering M.D.

FaaR · Post by **FaaR** » Fri Oct 09, 2009 4:27 pm

7im wrote:But as a general rule, the project does not send out the same work unit multiple times.

For reasons of error correction/redundancy, WUs should be processed at least twice.

SETI@Home has always sent out multiple copies of their packets for this very reason, to minimize the risk of calculation errors creeping in. After all, you have no control over the client's PC in distributed computing. It could have faulty RAM, be way too overclocked, get zapped by a cosmic ray that randomly flips bits in a register etc.

John Naylor · Post by **John Naylor** » Fri Oct 09, 2009 5:08 pm

FaaR wrote:
7im wrote:But as a general rule, the project does not send out the same work unit multiple times.
For reasons of error correction/redundancy, WUs should be processed at least twice.

SETI@Home has always sent out multiple copies of their packets for this very reason, to minimize the risk of calculation errors creeping in. After all, you have no control over the client's PC in distributed computing. It could have faulty RAM, be way too overclocked, get zapped by a cosmic ray that randomly flips bits in a register etc.

The Folding@home project values accuracy above all else, but the cores have so many checks in them to ensure that work units are accurate first time (nearly all of us have experienced work unit restarts when checksums don't match) that the Pande Group is confident that results processed first time are correct. Things like logic errors due to memory failure are probably accounted for, otherwise they would not feel so confident as to only send out units once. SETI@home has open source cores and allows modification, which is why they need to obtain a quorum before they are sure of the results. Folding@home stays closed source and has built in checks to negate this requirement and thereby speed the research as all (most) units only need to be run once.

FaaR · Post by **FaaR** » Fri Oct 09, 2009 6:34 pm

It may have error checks during calculation, but once a frame's been processed, then what?

Data in RAM or on disk is still vulnerable - Ars Technica did a piece yesterday or the day before on a recent study that shows RAM errors are magnitudes more common than previously believed, and with tens, if not hundreds of thousands of CPUs processing folding data... Almost all of those are consumer systems that lack parity RAM modules, that lack ECC-protection for buses, registers and caches.

While it's still statistically very unlikely there will be an error, it's not wise to trust in the law of probabilities that something WON'T happen. For example, we don't get in a car accident every time we get in behind the wheel, or even every thousandth time, but most of us still put on their seatbelt anyway. Most cars today have airbags.

With folding@home having processed data in the tera/petaflop range for years now, the chance that errors WILL - or already have - occurred is pretty much a total certainty.

ChasR · Post by **ChasR** » Fri Oct 09, 2009 7:44 pm

Interestingly I had p2671 r56/c72/g106 assigned twice to different machines. They downloaded on different days, 10/1 and 10/2 and were both returned on 10/3 one at 12:30 EDT and the other at 12:46 EDT.

Looking through the completed WUs (data from HFM), I see that duplicate assignments of p2671 aren't all that rare.
p2671 (9/16/107) on 10/2
p2671 (14/29/107) on 10/3
p2671 (16/15/106) on 9/29
p2671 (19/6/107) on 10/03
p2671 (2/73/105) on 9/24
p2671 (27/13/108) on 10/8
p2671 (31/99/105) on 9/25
p2671 (51/99/106) on 10/1

7im · Post by **7im** » Fri Oct 09, 2009 8:31 pm

FaaR wrote:It may have error checks during calculation, but once a frame's been processed, then what?

Data in RAM or on disk is still vulnerable - Ars Technica did a piece yesterday or the day before on a recent study that shows RAM errors are magnitudes more common than previously believed, and with tens, if not hundreds of thousands of CPUs processing folding data... Almost all of those are consumer systems that lack parity RAM modules, that lack ECC-protection for buses, registers and caches.

While it's still statistically very unlikely there will be an error, it's not wise to trust in the law of probabilities that something WON'T happen. For example, we don't get in a car accident every time we get in behind the wheel, or even every thousandth time, but most of us still put on their seatbelt anyway. Most cars today have airbags.

With folding@home having processed data in the tera/petaflop range for years now, the chance that errors WILL - or already have - occurred is pretty much a total certainty.

Again, Stanford does NOT need to process the same work unit more than one time for the sake of data error checking. Stanford uses other methods, such as statistical overlap. Because the Runs, Gens, and Clones variations in the work units, they know how the results are supposed to match with each other. If one doesn't match up to it's neighbors like expected, they can dump the results and re-run that indiviudal work unit, or use the statistical overlap from it's neighbors to fill in the picture.

Sorry to PG if I butchered this explanation. I haven't discussed or reread this old topic in quite a while.

@ChasR, et all - I don't know why the duplicates have increased as of late. Probably need to ask the Principal Investigator for that particular project #.

codysluder · Post by **codysluder** » Fri Oct 16, 2009 10:03 pm

John Naylor wrote:SETI@home has open source cores and allows modification, which is why they need to obtain a quorum before they are sure of the results.

IMHO SETI is stuck with being wrong in that area. They've had real problems with people submitting bogus results just to earn points and the only way they can combat it is to waste people's time reprocessing the same WU over and over until they're confident they've got a consensus about which answer is believable.

When SETI does run out of work, they have no qualms about reissuing work that has already been done. The FAH servers have also run out of work occasionally (often during school holiday times when few researchers may be working) but there always have been plenty of new projects waiting in the wings and we only have to wait while someone sets up a new project or extends an existing one. There is no danger of FAH "finishing" in the foreseeable future, which also means they don't want to waste your time duplicating work unnecessarily.

Folding Forum

Assignment server goof up?

Assignment server goof up?

Re: Assignment server goof up?

Re: Assignment server goof up?

Re: Assignment server goof up?

Re: Assignment server goof up?

Re: Assignment server goof up?

Re: Assignment server goof up?

Re: Assignment server goof up?

Re: Assignment server goof up?

Re: Assignment server goof up?

Re: Assignment server goof up?