Page 1 of 2

P3906 Gen 1 all failing

Posted: Thu Jan 17, 2008 11:59 am
by Pette Broad
Just had a whole rash of failures, all 3906's, SEVERAL WU's on different machines


MACHINE 1

Code: Select all

[10:15:26] Project: 3906 (Run 16, Clone 4, Gen 1)
[10:15:26] 
[10:15:26] Assembly optimizations on if available.
[10:15:26] Entering M.D.
[10:15:40] CoreStatus = C000000D (-1073741811)
[10:15:40] Client-core communications error: ERROR 0xc000000d
[10:15:40] Deleting current work unit & continuing...
[10:15:44] - Preparing to get new work unit...
[10:15:44] + Attempting to get work packet
[10:15:44] - Connecting to assignment server
[10:15:44] - Successful: assigned to (171.64.122.88).
[10:15:44] + News From Folding@Home: Welcome to Folding@Home
[10:15:45] Loaded queue successfully.
[10:15:48] + Closed connections
[10:15:53] 
[10:15:53] + Processing work unit
[10:15:53] Core required: FahCore_7b.exe
[10:15:53] Core found.
[10:15:53] Working on Unit 04 [January 17 10:15:53]
[10:15:53] + Working ...
[10:15:53] 
[10:15:53] *------------------------------*
[10:15:53] Folding@Home Double Gromacs Core B
[10:15:53] Version 1.04 (Fri Aug 10 16:46:39 PDT 2007)
[10:15:53] 
[10:15:53] Preparing to commence simulation
[10:15:53] - Files status OK
[10:15:53] - Expanded 341797 -> 1169393 (decompressed 342.1 percent)
[10:15:53] 
[10:15:53] Project: 3906 (Run 16, Clone 4, Gen 1)
[10:15:53] 
[10:15:54] Assembly optimizations on if available.
[10:15:54] Entering M.D.
[10:16:05] CoreStatus = C000000D (-1073741811)
[10:16:05] Client-core communications error: ERROR 0xc000000d
[10:16:05] Deleting current work unit & continuing...
[10:16:09] - Preparing to get new work unit...
[10:16:09] + Attempting to get work packet
[10:16:09] - Connecting to assignment server
[10:16:10] - Successful: assigned to (171.64.122.88).
[10:16:10] + News From Folding@Home: Welcome to Folding@Home
[10:16:10] Loaded queue successfully.
[10:16:15] + Closed connections
MACHINE 2

Code: Select all

[09:13:51] Project: 3906 (Run 14, Clone 2, Gen 1)
[09:13:51] 
[09:13:51] Assembly optimizations on if available.
[09:13:51] Entering M.D.
[09:58:12] CoreStatus = C000000D (-1073741811)
[09:58:12] Client-core communications error: ERROR 0xc000000d
[09:58:12] Deleting current work unit & continuing...
[09:58:24] Trying to send all finished work units
[09:58:24] + No unsent completed units remaining.
[09:58:24] - Preparing to get new work unit...
[09:58:24] + Attempting to get work packet
[09:58:24] - Will indicate memory of 2047 MB.
[09:58:24] - Connecting to assignment server
[09:58:25] - Successful: assigned to (171.64.122.88).
Carried on like this for 6 or 7 attempts.


MACHINE 3

Code: Select all

[09:58:22] Project: 3906 (Run 23, Clone 0, Gen 1)
[09:58:22] 
[09:58:22] Assembly optimizations on if available.
[09:58:22] Entering M.D.
[09:58:34] CoreStatus = C000000D (-1073741811)
[09:58:34] Client-core communications error: ERROR 0xc000000d
[09:58:34] Deleting current work unit & continuing...
[09:58:38] - Preparing to get new work unit...
[09:58:38] + Attempting to get work packet
[09:58:38] - Connecting to assignment server
[09:58:39] - Successful: assigned to (171.64.122.88).
[09:58:39] + News From Folding@Home: Welcome to Folding@Home
[09:58:39] Loaded queue successfully.
[09:58:43] + Closed connections
[09:58:48] 
[09:58:48] + Processing work unit
[09:58:48] Core required: FahCore_7b.exe
[09:58:48] Core found.
[09:58:48] Working on Unit 05 [January 17 09:58:48]
[09:58:48] + Working ...
[09:58:48] 
[09:58:48] *------------------------------*
[09:58:48] Folding@Home Double Gromacs Core B
[09:58:48] Version 1.04 (Fri Aug 10 16:46:39 PDT 2007)
[09:58:48] 
[09:58:48] Preparing to commence simulation
[09:58:48] - Files status OK
[09:58:48] - Expanded 330340 -> 1132781 (decompressed 342.9 percent)
[09:58:48] 
[09:58:48] Project: 3906 (Run 23, Clone 0, Gen 1)
[09:58:48] 
[09:58:49] Assembly optimizations on if available.
[09:58:49] Entering M.D.
[09:59:00] CoreStatus = C000000D (-1073741811)
[09:59:00] Client-core communications error: ERROR 0xc000000d
[09:59:00] - Attempting to download new core...
[09:59:00] + Downloading new core: FahCore_7b.exe
[09:59:01] + 10240 bytes downloaded
--------------------------------------------
[09:59:05] + 724218 bytes downloaded
[09:59:05] Verifying core Core_7b.fah...
[09:59:05] Signature is VALID
[09:59:05] 
[09:59:05] Trying to unzip core FahCore_7b.exe
[09:59:05] Decompressed FahCore_7b.exe (2101248 bytes) successfully
[09:59:05] + Core successfully engaged
[09:59:05] Deleting current work unit & continuing...
MACHINE 4

Code: Select all

[11:29:49] Project: 3906 (Run 54, Clone 1, Gen 1)
[11:29:49] 
[11:29:50] Assembly optimizations on if available.
[11:29:50] Entering M.D.
[11:48:15] CoreStatus = C000000D (-1073741811)
[11:48:15] Client-core communications error: ERROR 0xc000000d
[11:48:15] Deleting current work unit & continuing...
The main thing is that these errors throw up a core error on the desktop and need to be restarted manually :(

EDIT, Just checked and I found that I have another seven P3906 units in progress, they are all Gen 0 and have Run numbers in the 4xxx range and are progressing O.K. The 3 machines whose logs are posted above eventually got allocated other work.

Pete

Re: P3906 Gen 1 all failing

Posted: Thu Jan 17, 2008 1:05 pm
by [Fight]Gor
I'll second that. The same issue here.

Re: P3906 Gen 1 all failing

Posted: Thu Jan 17, 2008 1:17 pm
by Pette Broad
2 More in the last 10 minutes :shock:

Code: Select all

[13:07:21] Project: 3906 (Run 102, Clone 4, Gen 1)
[13:07:21] 
[13:07:22] Assembly optimizations on if available.
[13:07:22] Entering M.D.
[13:07:34] CoreStatus = C000000D (-1073741811)
----------------------------------------------------------
[13:10:56] Project: 3906 (Run 96, Clone 2, Gen 1)
[13:10:56] 
[13:10:56] Assembly optimizations on if available.
[13:10:56] Entering M.D.
[13:11:08] CoreStatus = C000000D (-1073741811)

Re: P3906 Gen 1 all failing

Posted: Thu Jan 17, 2008 2:11 pm
by Jeannie
Same problem

[10:40:36] Project: 3906 (Run 45, Clone 2, Gen 1)
[10:40:36]
[10:40:36] Assembly optimizations on if available.
[10:40:36] Entering M.D.
[10:40:42] mdrun returned -1
[10:40:42] Going to send back what have done.
[10:40:42] logfile size: 858
[10:40:42] - Writing 1396 bytes of core data to disk...
[10:40:42] Done: 884 -> 584 (compressed to 66.0 percent)
[10:40:42] ... Done.
[10:40:42]
[10:40:42] Folding@home Core Shutdown: EARLY_UNIT_END
[10:40:46] CoreStatus = 72 (114)
<snip>
[10:41:02] Project: 3906 (Run 48, Clone 2, Gen 1)
[10:41:02]
[10:41:02] Assembly optimizations on if available.
[10:41:02] Entering M.D.
[10:41:08] mdrun returned -1
[10:41:08] Going to send back what have done.
[10:41:08] logfile size: 858
[10:41:08] - Writing 1396 bytes of core data to disk...
[10:41:08] Done: 884 -> 590 (compressed to 66.7 percent)
[10:41:08] ... Done.
[10:41:08]
[10:41:08] Folding@home Core Shutdown: EARLY_UNIT_END
[10:41:11] CoreStatus = 72 (114)
[10:41:11] Sending work to server

Re: P3906 Gen 1 all failing

Posted: Thu Jan 17, 2008 2:17 pm
by Pette Broad
Those have failed in a slightly different way.......but I've had one like that too :)

Code: Select all

[09:55:16] Project: 3906 (Run 9, Clone 3, Gen 1)
[09:55:16] 
[09:55:16] Assembly optimizations on if available.
[09:55:16] Entering M.D.
[09:55:22] mdrun returned -1
[09:55:22] Going to send back what have done.
[09:55:22] logfile size: 858
[09:55:22] - Writing 1396 bytes of core data to disk...
[09:55:22] Done: 884 -> 585 (compressed to 66.1 percent)
[09:55:22]   ... Done.
[09:55:22] 
[09:55:22] Folding@home Core Shutdown: EARLY_UNIT_END
[09:55:26] CoreStatus = 72 (114)
[09:55:26] Sending work to server

Re: P3906 Gen 1 all failing

Posted: Thu Jan 17, 2008 3:51 pm
by daveb
I have had 3 of these in a row fail teh same way. I get a Windows message box saying that core7b has been killed and then a client-core communication error. Everyting was then deleted and the machine tried to get another unit. One thing I noticed on all of the failed units is that the download size of the wudata_0x.dat file is ~340 k on all 3 of the failed units. The earleir units of p3906 and p3907 I have seen were all around 240 k.

3906 (Run 132, Clone 4, Gen 1) payload 342246
3906 (Run 171, Clone 1, Gen 1) payload 330471
3906 (Run 183, Clone 3, Gen 1) payload 329464

Actually, the last of these did not generate the Windows error box, just a message in the console window saying mdrun returned -1 followed by a standard EUE message.

Dave

Re: P3906 Gen 1 all failing

Posted: Thu Jan 17, 2008 5:49 pm
by nzcarrick
Yip Same Here

Tried deleting 7b core thinking it was corrupt, it downloaded again and then threw up same fault.


Any ideas?


Code: Select all

*------------------------------*
[17:38:51] Folding@Home Double Gromacs Core B
[17:38:51] Version 1.04 (Fri Aug 10 16:46:39 PDT 2007)
[17:38:51] 
[17:38:51] Preparing to commence simulation
[17:38:51] - Files status OK
[17:38:51] - Expanded 341674 -> 1169293 (decompressed 342.2 percent)
[17:38:51] 
[17:38:51] Project: 3906 (Run 98, Clone 4, Gen 1)
[17:38:51] 
[17:38:51] Assembly optimizations on if available.
[17:38:51] Entering M.D.
[17:39:05] CoreStatus = C000000D (-1073741811)
[17:39:05] Client-core communications error: ERROR 0xc000000d
[17:39:05] Deleting current work unit & continuing...
[17:39:09] Trying to send all finished work units

Re: P3906 Gen 1 all failing

Posted: Thu Jan 17, 2008 7:30 pm
by 7im
PM sent to kasson.

Re: P3906 Gen 1 all failing

Posted: Fri Jan 18, 2008 12:34 am
by Wonder Bread
Had the same problem myself, registered specifically to report it. First error I've had since I started folding a few years ago. Windows pops up and says 'FahCore_7b.exe' has crashed. I'm using the 6.0 beta 1 client if that matters.
[00:24:43] Core required: FahCore_7b.exe
[00:24:43] Core found.
[00:24:43] Working on Unit 08 [January 18 00:24:43]
[00:24:43] + Working ...
[00:24:43] - Calling 'FahCore_7b.exe -dir work/ -suffix 08 -checkpoint 15 -forceasm -verbose -lifeline 2804 -version 600'
[00:24:43] *------------------------------*
[00:24:43] Folding@Home Double Gromacs Core B
[00:24:43] Version 1.04 (Fri Aug 10 16:46:39 PDT 2007)
[00:24:43]
[00:24:43] Preparing to commence simulation
[00:24:43] - Ensuring status. Please wait.
[00:24:52] - Assembly optimizations manually forced on.
[00:24:52] - Not checking prior termination.
[00:24:52] - Expanded 329392 -> 1131813 (decompressed 343.6 percent)
[00:24:52]
[00:24:52] Project: 3906 (Run 230, Clone 3, Gen 1)
[00:24:52]
[00:24:53] Assembly optimizations on if available.
[00:24:53] Entering M.D.
[00:25:11] CoreStatus = C000000D (-1073741811)
[00:25:11] Client-core communications error: ERROR 0xc000000d

Re: P3906 Gen 1 all failing

Posted: Fri Jan 18, 2008 12:38 am
by 7im
7im wrote:PM sent to kasson.
Kasson said he would take a look at the problem, as soon as he got back in to the office. Unless everyone starts having problems with Gen 2 as well, I think we have enough reports. Thanks.

Re: P3906 Gen 1 all failing

Posted: Fri Jan 18, 2008 4:40 am
by MDCRL
for what it's worth.... and maybe the info can help narrow down the problem.....

I've burned through 16 of those WU's in the last week or so... using 5.04 console, moving really fast on a xeon processor and an AMD 64.... 10-15 minutes per step - been lucky to get a bunch of good ones I guess....
- good points for the time spent

only had one early end unit recently on a 3903

Re: P3906 Gen 1 all failing

Posted: Fri Jan 18, 2008 5:29 am
by 7im
MDCRL wrote:...
I've burned through 16 of those WU's in the last week or so...

only had one early end unit recently on a 3903
I think we have it narrowed down quite a bit. Only work units from Project 3906, and only work units from Generation 1 (Run xx, Clone xx, Gen 01)

MDCRL, how many Gen 1 WUs in project 3906 were in those 16 WUs you worked on this last week or so?

Re: P3906 Gen 1 all failing

Posted: Fri Jan 18, 2008 8:04 am
by MDCRL
I'll have to check through the logfiles after work today - I should have some #'s for you tonight.
I have about 5 or 6 currently active 3906 WU's, all progressing well
- they are all Gen 0 though....

I am noticing something else while going through these active WU's..

the Actual % complete is not matching the reported % complete in FahMonitor.
I use FahMon 2.3.1 - it is usually accurate, but not on these it seems

- any relevance in that relating to "client-core communication error"?

Let me know if you want the info on these current active WU also.....

Re: P3906 Gen 1 all failing

Posted: Fri Jan 18, 2008 8:36 am
by KWSN_Dagger
MDCRL wrote: I am noticing something else while going through these active WU's..

the Actual % complete is not matching the reported % complete in FahMonitor.
I use FahMon 2.3.1 - it is usually accurate, but not on these it seems

- any relevance in that relating to "client-core communication error"?

Let me know if you want the info on these current active WU also.....
I use FahMon as well.. these WU's are only 50 frames long, so FahMon will work up to 50%, then after that reports it as % minus 50%. It's wierd I know, but maybe it's looked after when the new version 2.3.2 comes out. Uncle Fungus will know more than I.

Re: P3906 Gen 1 all failing

Posted: Fri Jan 18, 2008 4:32 pm
by kasson
Thanks for the reports. I stopped P3906 assigning last night until we can have a proper look at the problem. Hopefully we can iron this out rapidly and have them back running successfully in the near future.