Project: 2975 (Run 184, Clone 0, Gen 3)

Moderators: Site Moderators, FAHC Science Team

Post Reply
bollix47
Posts: 2982
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Project: 2975 (Run 184, Clone 0, Gen 3)

Post by bollix47 »

Possible bad WU.

Previous results for this project gave a TPF of ~36min.

This one was taking over 3 hours per step and wasn't going to make the deadline.

Code: Select all

[11:49:33] Project: 2975 (Run 184, Clone 0, Gen 3)
[11:49:33] 
[11:49:33] Assembly optimizations on if available.
[11:49:33] Entering M.D.
[11:49:39] Using Gromacs checkpoints
[11:49:39] Mapping NT from 1 to 1 
[11:49:40] Resuming from checkpoint
[11:49:40] Verified work/wudata_06.log
[11:49:40] Verified work/wudata_06.trr
[11:49:40] Verified work/wudata_06.xtc
[11:49:40] Verified work/wudata_06.edr
[11:49:43] Completed 16510 out of 2500000 steps  (0%)
[12:51:38] Completed 25000 out of 2500000 steps  (1%)
[15:53:39] Completed 50000 out of 2500000 steps  (2%)
[16:47:08] ***** Got a SIGTERM signal (2)
[16:47:08] Killing all core threads
Jesse_V
Site Moderator
Posts: 2850
Joined: Mon Jul 18, 2011 4:44 am
Hardware configuration: OS: Windows 10, Kubuntu 19.04
CPU: i7-6700k
GPU: GTX 970, GTX 1080 TI
RAM: 24 GB DDR4
Location: Western Washington

Re: Project: 2975 (Run 184, Clone 0, Gen 3)

Post by Jesse_V »

Hmm it seems to me unusual that a bad WU would manifest itself as taking an exceptional amount of time to complete. My first thought is that there is something else going on here. What's your hardware? Are there background processes running? I just seems to me that the WU is fine its just taking a long time on your computer because there's something going on at your end. I'd like to figure it out, so more information please.
F@h is now the top computing platform on the planet and nothing unites people like a dedicated fight against a common enemy. This virus affects all of us. Lets end it together.
bollix47
Posts: 2982
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Project: 2975 (Run 184, Clone 0, Gen 3)

Post by bollix47 »

Jesse, Thanks for your reply. I always check the things that you mentioned prior to making a report but probably should have mentioned that in my OP. As far as a bad unit not manifesting itself in this manner, there have been other reports of work units manifesting themselves in exactly this manner. Hence, my report. I've moved on to a different WU and it's progressing normally.
bruce
Posts: 20822
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2975 (Run 184, Clone 0, Gen 3)

Post by bruce »

There's a chance that it's a bad WU, but that's really difficult to determine.

At this point, nobody has completed that WU. Unfortunately we have no way to determine when WUs are assigned (or reissued) and that would be very useful information if everybody who processes the WU sees the same characteristics that you're reporting.

What was the date-time that the WU was assigned to you?
bollix47
Posts: 2982
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Project: 2975 (Run 184, Clone 0, Gen 3)

Post by bollix47 »

November 1st in the early morning hours here in the eastern time zone

Code: Select all

[09:39:06] + Attempting to get work packet
[09:39:06] - Will indicate memory of 6132 MB
[09:39:06] - Connecting to assignment server
[09:39:06] Connecting to http://assign.stanford.edu:8080/
[09:39:07] Posted data.
[09:39:07] Initial: 598F; - Successful: assigned to (143.89.28.72).
[09:39:07] + News From Folding@Home: Welcome to Folding@Home
[09:39:07] Loaded queue successfully.
[09:39:07] Connecting to http://143.89.28.72:8080/
[09:39:09] Posted data.
[09:39:09] Initial: 0000; - Receiving payload (expected size: 28517)
[09:39:09] Conversation time very short, giving reduced weight in bandwidth avg
[09:39:09] - Downloaded at ~55 kB/s
[09:39:09] - Averaged speed for that direction ~180 kB/s
[09:39:09] + Received work.
[09:39:09] Trying to send all finished work units
[09:39:09] + No unsent completed units remaining.
[09:39:09] + Closed connections
[09:39:09] 
[09:39:09] + Processing work unit
[09:39:09] Core required: FahCore_a4.exe
[09:39:09] Core found.
[09:39:09] Working on queue slot 06 [November 1 09:39:09 UTC]
[09:39:09] + Working ...
[09:39:09] - Calling '.\FahCore_a4.exe -dir work/ -suffix 06 -priority 96 -checkpoint 30 -verbose -lifeline 4052 -version 623'

[09:39:09] 
[09:39:09] *------------------------------*
[09:39:09] Folding@Home Gromacs GB Core
[09:39:09] Version 2.27 (Dec. 15, 2010)
[09:39:09] 
[09:39:09] Preparing to commence simulation
[09:39:09] - Looking at optimizations...
[09:39:09] - Created dyn
[09:39:09] - Files status OK
[09:39:09] - Expanded 28005 -> 1490352 (decompressed 5321.7 percent)
[09:39:09] Called DecompressByteArray: compressed_data_size=28005 data_size=1490352, decompressed_data_size=1490352 diff=0
[09:39:09] - Digital signature verified
[09:39:09] 
[09:39:09] Project: 2975 (Run 184, Clone 0, Gen 3)
[09:39:09] 
[09:39:09] Assembly optimizations on if available.
[09:39:09] Entering M.D.
[09:39:15] Mapping NT from 1 to 1 
bruce
Posts: 20822
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2975 (Run 184, Clone 0, Gen 3)

Post by bruce »

OK. November 1 09:39:09 UTC makes sense ... or in Stanford's timezone: November 1 02:39:09 PDT. I note that the previous WU, Gen 2, was returned to the server twice,

WU assigned to donor1 at: 2011-10-03 08:33:53 PDT
Past the Preferred deadline 13.30 days later at 2011-10-16 15:45:53 PDT
WU assigned to donor2 at: 2011-10-16 16:08:46 PDT <--- 23 minutes later.
Logged as returned by donor1 at: 2011-10-18 19:03:48 PDT
Logged as returned by donor2 at: 2011-10-22 09:04:57 PDT

I don't know how long after a Gen is returned before the next Gen is assigned, but the earliest that Gen 3 could have been assigned was 2011-10-18 19:03:48 PDT. That WU would be scheduled to be reissued 13.30 days later at 2011-11-01 02:15:31 PDT which would be 2011-11-01 09:15:31 UTC which allows an additional 24 minutes the WU waiting on the server to be reassigned. I conclude that the WU was assigned to someone else before you got it and either they're still working on it or they've dumped it.

[BTW, these calculations are not easy, even when I have access to much of the data. I learned something by doing it and it's also instructive because it shows that both Donor1 and Donor2 got credit for the same WU. Reissuing Gen 2 made sense at 2011-10-16 15:45:53 PDT because the WU was presumed lost and that made sure it would be returned by 2011-10-22 09:04:57 PDT. The server didn't know that Donor1 was still working on it. The fact that donor1 did return it at 2011-10-18 19:03:48 PDT helped the project along since that was still 3+ days before donor2's result was turned in.]

What kind of hardware are you running?
bollix47
Posts: 2982
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Project: 2975 (Run 184, Clone 0, Gen 3)

Post by bollix47 »

The WU was running on one core of an i7 940 @ stock of 2.93Ghz with 6GB of ram.

The computer also has an smp client using -smp 6 and a gpu(gtx 285) client.

All clients are using v6 console running on Windows Vista 64-bit.

If I understand what you're saying about Gen 2 being done twice does that mean that two Gen 3's would be created or is the system set to disallow the 2nd creation?

Sorry for the delay in answering Bruce ... you must have added that question after I read your initial response. :ewink:
bruce
Posts: 20822
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2975 (Run 184, Clone 0, Gen 3)

Post by bruce »

No.

When a WU is completed, it is processed to create Gen (N+1) and that WU is issued. Extra copies are avoided whenever possible, and the duplication of Gen 2 only happened because it was assumed to be lost. That assumption later proved to be incorrect but the whole point of the duplication of the WU is to get a result from Gen 2 so that Gen 3 can be generated.

My calculations seem to imply that Gen 3 was issued "immediately" (well, within maybe 12 minutes) and it expired, too. You got the second copy of Gen 3 but only after 13.3 days (plus maybe 12 minutes).

The server-based process of creating Gen 3 from Gen 2 would not have run at 2011-10-22 09:04:57 PDT because Gen 3 already existed at that time.

For anyone familiar with BOINC projects, FAH is very, very different in this respect.
sortofageek
Site Admin
Posts: 3110
Joined: Fri Nov 30, 2007 8:06 pm
Location: Team Helix
Contact:

Re: Project: 2975 (Run 184, Clone 0, Gen 3)

Post by sortofageek »

Just a followup on this work unit. Database info now indicates three other folders have completed Project: 2975 (Run 184, Clone 0, Gen 3) successfully.

Folder 1.
Your WU (P2975 R184 C0 G3) was added to the stats database on 2011-11-12 13:08:56 for 1681.7 points of credit.

Folder 2.
Your WU (P2975 R184 C0 G3) was added to the stats database on 2011-11-13 03:06:38 for 895 points of credit.

Folder 3.
Your WU (P2975 R184 C0 G3) was added to the stats database on 2011-11-13 23:06:38 for 1823.11 points of credit.
bollix47
Posts: 2982
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Project: 2975 (Run 184, Clone 0, Gen 3)

Post by bollix47 »

It would be of interest to know how long they took to complete this WU. Were they also looking at TPFs of over 3 hours or did they complete them in sub 1 hour TPFs? Download/upload times should be close to being able to determine which TPF they were closer to assuming they ran 24/7.
bruce
Posts: 20822
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2975 (Run 184, Clone 0, Gen 3)

Post by bruce »

Days taken to complete WU: 4.72
Days taken to complete WU: 18.62
Days taken to complete WU: 4.01
bollix47
Posts: 2982
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Project: 2975 (Run 184, Clone 0, Gen 3)

Post by bollix47 »

Interesting ... 2 got similar TPF to my previous experience with the project (different WU where I was getting ~36 minutes TPF using a single core) and one got similar to my experience on the same WU of ~3 hours TPF. It's almost as if there are two different WUs with the same PRCG but if I understood what Bruce said earlier that's not possible. If I get another from this project I'll just let it run no matter how long it takes.

Thanks for checking.
bruce
Posts: 20822
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2975 (Run 184, Clone 0, Gen 3)

Post by bruce »

I think the assumption that Donor #2 runs 24x7 is false. We have no data about how many hours (s)he actually was folding during that 18.62 days and without that information, guessing what TPF he got during those processing hours is meaningless.

... and then, too, there's the probability of different hardware.
Post Reply