Assignment server goof up?

Moderators: Site Moderators, FAHC Science Team

Post Reply
BrokenWolf
Posts: 126
Joined: Sat Aug 02, 2008 3:08 am

Assignment server goof up?

Post by BrokenWolf »

I have 2 separate physical systems that are running the same exact WU. p2671, R56/C72/G107. They were assigned these WU's within 18 minutes of each other (over 5 hours ago). They have different Client IDs in the FAHlog.txt Should I trash one and let is start on another (if the assignment server lets it)?

Code: Select all

[19:07:38] Initial: 0000; - Receiving payload (expected size: 4833821)
[19:07:40] - Downloaded at ~2360 kB/s
[19:07:40] - Averaged speed for that direction ~3363 kB/s
[19:07:40] + Received work.
[19:07:40] Trying to send all finished work units
[19:07:40] + No unsent completed units remaining.
[19:07:40] + Closed connections
[19:07:40] 
[19:07:40] + Processing work unit
[19:07:40] Core required: FahCore_a2.exe
[19:07:40] Core found.
[19:07:40] Working on queue slot 07 [October 5 19:07:40 UTC]
[19:07:40] + Working ...
[19:07:40] - Calling './mpiexec -np 8 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 07 -priority 96 -checkpoint 15 -verbose -lifeline 7874 -version 624'

[19:07:40] 
[19:07:40] *------------------------------*
[19:07:40] Folding@Home Gromacs SMP Core
[19:07:40] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[19:07:40] 
[19:07:40] Preparing to commence simulation
[19:07:40] - Ensuring status. Please wait.
[19:07:41] Called DecompressByteArray: compressed_data_size=4833309 data_size=24033973, decompressed_data_size=24033973 diff=0
[19:07:41] - Digital signature verified
[19:07:41] 
[19:07:41] Project: 2671 (Run 56, Clone 72, Gen 107)
[19:07:41] 
[19:07:41] Assembly optimizations on if available.
[19:07:41] Entering M.D.
[19:07:51] un 56, Clone 72, Gen 107)
[19:07:51] 
[19:07:51] Entering M.D.
[19:12:29] pleted 2500 out of 250000 steps  (1%)
From the other system:

Code: Select all

[19:23:11] Initial: 0000; - Receiving payload (expected size: 4833821)
[19:23:14] - Downloaded at ~1573 kB/s
[19:23:14] - Averaged speed for that direction ~1423 kB/s
[19:23:14] + Received work.
[19:23:14] Trying to send all finished work units
[19:23:14] + No unsent completed units remaining.
[19:23:14] + Closed connections
[19:23:14] 
[19:23:14] + Processing work unit
[19:23:14] Core required: FahCore_a2.exe
[19:23:14] Core found.
[19:23:14] Working on queue slot 06 [October 5 19:23:14 UTC]
[19:23:14] + Working ...
[19:23:14] - Calling './mpiexec -np 8 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 06 -priority 96 -checkpoint 15 -verbose -lifeline 24175 -version 624'

[19:23:14] 
[19:23:14] *------------------------------*
[19:23:14] Folding@Home Gromacs SMP Core
[19:23:14] Version 2.09 (Sun Aug 30 03:43:28 CEST 2009)
[19:23:14] 
[19:23:14] Preparing to commence simulation
[19:23:14] - Ensuring status. Please wait.
[19:23:14] Files status OK
[19:23:15] - Expanded 4833309 -> 24033973 (decompressed 497.2 percent)
[19:23:15] Called DecompressByteArray: compressed_data_size=4833309 data_size=24033973, decompressed_data_size=24033973 diff=0
[19:23:15] - Digital signature verified
[19:23:15] 
[19:23:15] Project: 2671 (Run 56, Clone 72, Gen 107)
[19:23:15] 
[19:23:15] Assembly optimizations on if available.
[19:23:15] Entering M.D.
[19:23:25] un 56, Clone 72, Gen 107)
[19:23:25] 
[19:23:25] Entering M.D.
[19:27:56] pleted 2500 out of 250000 steps  (1%)
BW
Image
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Assignment server goof up?

Post by 7im »

To answer your thread title question, probably not, no. When you get credit for both work units, you will know for sure. ;)
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
max
Posts: 4
Joined: Thu Jul 17, 2008 10:27 pm

Re: Assignment server goof up?

Post by max »

I have 2 Linux SMP clients crunching the same WU (P2671 (R6, C31, G108)) and 2 GPU clients with the same WU (P5771 (R5, C54, G372)) currently. The SMP clients are on completely different machines, GPUs have different machine IDs. User IDs as reported by the log file are completely different. No cloning involved. I've seen similar behavior before. What gives?
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Assignment server goof up?

Post by 7im »

Under rare circumstances (of which I know very little about), work units with the same PRCG info ARE sent out to more than one client.

One example I have seen in the past (and this is my interpretation of the events), they needed some results back urgently for a deadline, and so they sent out the work units more than once to make sure they got the results back, and got them back quickly. Folding the WU on more than one client, the chances of getting a faster computer, and thus a faster return, increases.

But as a general rule, the project does not send out the same work unit multiple times. But I would guess that at times, if a server goes down, the records of which work unit were sent out, and or received, might be lost, and so the WUs are sent out again. Naturally, this doesn't explain the above examples, IMO, but this is not unexpected, just like an occasional EUE is not unexpected. ;) And as long as you get credit for both work units, rare examples shouldn't be a big concern (IMO).
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
bollix47
Posts: 2959
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Assignment server goof up?

Post by bollix47 »

Same situation here on two different computer:

Code: Select all

[13:18:43] Connecting to http://171.67.108.24:8080/
[13:18:49] Posted data.
[13:18:49] Initial: 0000; - Receiving payload (expected size: 4838995)
[13:18:54] - Downloaded at ~945 kB/s
[13:18:54] - Averaged speed for that direction ~1048 kB/s
[13:18:54] + Received work.
[13:18:54] Trying to send all finished work units
[13:18:54] + No unsent completed units remaining.
[13:18:54] + Closed connections
[13:18:54] 
[13:18:54] + Processing work unit
[13:18:54] At least 4 processors must be requested.Core required: FahCore_a2.exe
[13:18:54] Core found.
[13:18:54] Working on queue slot 00 [October 6 13:18:54 UTC]
[13:18:54] + Working ...
[13:18:54] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 00 -checkpoint 30 -verbose -lifeline 6596 -version 624'

[13:18:54] 
[13:18:54] *------------------------------*
[13:18:54] Folding@Home Gromacs SMP Core
[13:18:54] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[13:18:54] 
[13:18:54] Preparing to commence simulation
[13:18:54] - Ensuring status. Please wait.
[13:18:55] Called DecompressByteArray: compressed_data_size=4838483 data_size=24038369, decompressed_data_size=24038369 diff=0
[13:18:55] - Digital signature verified
[13:18:55] 
[13:18:55] Project: 2671 (Run 11, Clone 74, Gen 108)
[13:18:55] 
[13:18:55] Assembly optimizations on if available.
[13:18:55] Entering M.D.
[13:19:05] un 11, Clone 74, Gen 108)
[13:19:05] 
[13:19:05] Entering M.D.

Code: Select all

[13:27:35] - Successful: assigned to (171.67.108.24).
[13:27:35] + News From Folding@Home: Welcome to Folding@Home
[13:27:35] Loaded queue successfully.
[13:27:46] + Closed connections
[13:27:46] 
[13:27:46] + Processing work unit
[13:27:46] Core required: FahCore_a2.exe
[13:27:46] Core found.
[13:27:46] Working on queue slot 04 [October 6 13:27:46 UTC]
[13:27:46] + Working ...
[13:27:46] 
[13:27:46] *------------------------------*
[13:27:46] Folding@Home Gromacs SMP Core
[13:27:46] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[13:27:46] 
[13:27:46] Preparing to commence simulation
[13:27:46] - Ensuring status. Please wait.
[13:27:56] - Assembly optimizations manually forced on.
[13:27:56] - Not checking prior termination.
[13:27:57] - Expanded 4838483 -> 24038369 (decompressed 496.8 percent)
[13:27:57] Called DecompressByteArray: compressed_data_size=4838483 data_size=24038369, decompressed_data_size=24038369 diff=0
[13:27:57] - Digital signature verified
[13:27:57] 
[13:27:57] Project: 2671 (Run 11, Clone 74, Gen 108)
[13:27:57] 
[13:27:57] Assembly optimizations on if available.
[13:27:57] Entering M.D.
Image
FaaR
Posts: 66
Joined: Tue Aug 19, 2008 1:32 am

Re: Assignment server goof up?

Post by FaaR »

7im wrote:But as a general rule, the project does not send out the same work unit multiple times.
For reasons of error correction/redundancy, WUs should be processed at least twice.

SETI@Home has always sent out multiple copies of their packets for this very reason, to minimize the risk of calculation errors creeping in. After all, you have no control over the client's PC in distributed computing. It could have faulty RAM, be way too overclocked, get zapped by a cosmic ray that randomly flips bits in a register etc.
John Naylor
Posts: 357
Joined: Mon Dec 03, 2007 4:36 pm
Hardware configuration: Q9450 OC @ 3.2GHz (Win7 Home Premium) - SMP2
E7500 OC @ 3.66GHz (Windows Home Server) - SMP2
i5-3750k @ 3.8GHz (Win7 Pro) - SMP2
Location: University of Birmingham, UK

Re: Assignment server goof up?

Post by John Naylor »

FaaR wrote:
7im wrote:But as a general rule, the project does not send out the same work unit multiple times.
For reasons of error correction/redundancy, WUs should be processed at least twice.

SETI@Home has always sent out multiple copies of their packets for this very reason, to minimize the risk of calculation errors creeping in. After all, you have no control over the client's PC in distributed computing. It could have faulty RAM, be way too overclocked, get zapped by a cosmic ray that randomly flips bits in a register etc.
The Folding@home project values accuracy above all else, but the cores have so many checks in them to ensure that work units are accurate first time (nearly all of us have experienced work unit restarts when checksums don't match) that the Pande Group is confident that results processed first time are correct. Things like logic errors due to memory failure are probably accounted for, otherwise they would not feel so confident as to only send out units once. SETI@home has open source cores and allows modification, which is why they need to obtain a quorum before they are sure of the results. Folding@home stays closed source and has built in checks to negate this requirement and thereby speed the research as all (most) units only need to be run once.
Folding whatever I'm sent since March 2006 :) Beta testing since October 2006. www.FAH-Addict.net Administrator since August 2009.
FaaR
Posts: 66
Joined: Tue Aug 19, 2008 1:32 am

Re: Assignment server goof up?

Post by FaaR »

It may have error checks during calculation, but once a frame's been processed, then what?

Data in RAM or on disk is still vulnerable - Ars Technica did a piece yesterday or the day before on a recent study that shows RAM errors are magnitudes more common than previously believed, and with tens, if not hundreds of thousands of CPUs processing folding data... Almost all of those are consumer systems that lack parity RAM modules, that lack ECC-protection for buses, registers and caches.

While it's still statistically very unlikely there will be an error, it's not wise to trust in the law of probabilities that something WON'T happen. For example, we don't get in a car accident every time we get in behind the wheel, or even every thousandth time, but most of us still put on their seatbelt anyway. Most cars today have airbags.

With folding@home having processed data in the tera/petaflop range for years now, the chance that errors WILL - or already have - occurred is pretty much a total certainty. :)
ChasR
Posts: 402
Joined: Sun Dec 02, 2007 5:36 am
Location: Atlanta, GA

Re: Assignment server goof up?

Post by ChasR »

Interestingly I had p2671 r56/c72/g106 assigned twice to different machines. They downloaded on different days, 10/1 and 10/2 and were both returned on 10/3 one at 12:30 EDT and the other at 12:46 EDT.

Looking through the completed WUs (data from HFM), I see that duplicate assignments of p2671 aren't all that rare.
p2671 (9/16/107) on 10/2
p2671 (14/29/107) on 10/3
p2671 (16/15/106) on 9/29
p2671 (19/6/107) on 10/03
p2671 (2/73/105) on 9/24
p2671 (27/13/108) on 10/8
p2671 (31/99/105) on 9/25
p2671 (51/99/106) on 10/1
Image
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Assignment server goof up?

Post by 7im »

FaaR wrote:It may have error checks during calculation, but once a frame's been processed, then what?

Data in RAM or on disk is still vulnerable - Ars Technica did a piece yesterday or the day before on a recent study that shows RAM errors are magnitudes more common than previously believed, and with tens, if not hundreds of thousands of CPUs processing folding data... Almost all of those are consumer systems that lack parity RAM modules, that lack ECC-protection for buses, registers and caches.

While it's still statistically very unlikely there will be an error, it's not wise to trust in the law of probabilities that something WON'T happen. For example, we don't get in a car accident every time we get in behind the wheel, or even every thousandth time, but most of us still put on their seatbelt anyway. Most cars today have airbags.

With folding@home having processed data in the tera/petaflop range for years now, the chance that errors WILL - or already have - occurred is pretty much a total certainty. :)
Again, Stanford does NOT need to process the same work unit more than one time for the sake of data error checking. Stanford uses other methods, such as statistical overlap. Because the Runs, Gens, and Clones variations in the work units, they know how the results are supposed to match with each other. If one doesn't match up to it's neighbors like expected, they can dump the results and re-run that indiviudal work unit, or use the statistical overlap from it's neighbors to fill in the picture.

Sorry to PG if I butchered this explanation. I haven't discussed or reread this old topic in quite a while.

@ChasR, et all - I don't know why the duplicates have increased as of late. Probably need to ask the Principal Investigator for that particular project #.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
codysluder
Posts: 1024
Joined: Sun Dec 02, 2007 12:43 pm

Re: Assignment server goof up?

Post by codysluder »

John Naylor wrote:SETI@home has open source cores and allows modification, which is why they need to obtain a quorum before they are sure of the results.
IMHO SETI is stuck with being wrong in that area. They've had real problems with people submitting bogus results just to earn points and the only way they can combat it is to waste people's time reprocessing the same WU over and over until they're confident they've got a consensus about which answer is believable.

When SETI does run out of work, they have no qualms about reissuing work that has already been done. The FAH servers have also run out of work occasionally (often during school holiday times when few researchers may be working) but there always have been plenty of new projects waiting in the wings and we only have to wait while someone sets up a new project or extends an existing one. There is no danger of FAH "finishing" in the foreseeable future, which also means they don't want to waste your time duplicating work unnecessarily.
Post Reply