Page 1 of 1

782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Posted: Wed Sep 03, 2008 10:55 pm
by arfyness
So, I've got two more errors since I went to bed last night... Still on the same work unit.

This machine has been folding for almost 3 weeks, and still has not submitted a work unit yet. They always fail. Before my trouble was with project 781, which seems to be a similar simulation, using the same core a0.

Code: Select all

--- Opening Log file [September 1 14:49:09] 


# Linux Console Edition #######################################################
###############################################################################

                       Folding@Home Client Version 6.02

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /usr/local/folding
Executable: ./fah6


[14:49:09] - Ask before connecting: No
[14:49:09] - User name: Arfyness (Team 45104)
[14:49:09] - User ID: <removed>
[14:49:09] - Machine ID: 1
[14:49:09] 
[14:49:09] Loaded queue successfully.
[14:49:09] 
[14:49:09] + Processing work unit
[14:49:09] Core required: FahCore_a0.exe
[14:49:09] Core found.
[14:49:09] Working on Unit 05 [September 1 14:49:09]
[14:49:09] + Working ...
[14:49:09] 
[14:49:09] *------------------------------*
[14:49:09] Folding@Home Gromacs 3.3 Core
[14:49:09] Version 1.92 (April 17. 2007)
[14:49:09] 
[14:49:09] Preparing to commence simulation
[14:49:09] - Looking at optimizations...
[14:49:09] - Files status OK
[14:49:10] - Expanded 1168013 -> 6252409 (decompressed 535.3 percent)
[14:49:10] 
[14:49:10] Project: 782 (Run 0, Clone 77, Gen 3)
[14:49:10] 
[14:49:10] Assembly optimizations on if available.
[14:49:10] Entering M.D.
No option -tpi
(single precision)
starting mdrun 'Mini chaperonin'
500000 steps,   1000.0 ps.

[14:49:31] (Starting from checkpoint)
[14:49:31] Protein: Mini chaperonin
[14:49:31] Writing local files
[14:49:31] Completed 18850 out of 500000 steps  (3%)
[14:49:32] Extra 3DNow boost OK.
[14:49:32] Extra SSE boost OK.
[15:08:44] Writing local files
[15:08:44] Completed 20000 out of 500000 steps  (4 percent)

     < --- SNIP --- >

[00:02:29] Completed 140000 out of 500000 steps  (28 percent)
[01:30:49] Writing local files
[01:30:49] Completed 145000 out of 500000 steps  (29 percent)
-------------------------------------------------------
Program Core_A0.exe, VERSION 3.3
Source code file: fatal.c, line: 342

Fatal error:
NaN detected: (ener[25])

-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[02:53:48] Gromacs error.
[02:53:48] 
[02:53:48] Folding@home Core Shutdown: UNKNOWN_ERROR
[02:53:49] CoreStatus = 79 (121)
[02:53:49] Client-core communications error: ERROR 0x79
[02:53:49] Deleting current work unit & continuing...
[02:54:06] - Preparing to get new work unit...
[02:54:06] + Attempting to get work packet
[02:54:06] - Connecting to assignment server
[02:54:07] - Successful: assigned to (171.64.122.138).
[02:54:07] + News From Folding@Home: Welcome to Folding@Home
[02:54:07] Loaded queue successfully.
[02:54:12] + Closed connections
[02:54:17] 
[02:54:17] + Processing work unit
[02:54:17] Core required: FahCore_a0.exe
[02:54:17] Core found.
[02:54:17] Working on Unit 06 [September 3 02:54:17]
[02:54:17] + Working ...
[02:54:17] 
[02:54:17] *------------------------------*
[02:54:17] Folding@Home Gromacs 3.3 Core
[02:54:17] Version 1.92 (April 17. 2007)
[02:54:17] 
[02:54:17] Preparing to commence simulation
[02:54:17] - Looking at optimizations...
[02:54:17] - Created dyn
[02:54:17] - Files status OK
[02:54:17] - Expanded 1168013 -> 6252409 (decompressed 535.3 percent)
[02:54:17] - Starting from initial work packet
[02:54:17] 
[02:54:17] Project: 782 (Run 0, Clone 77, Gen 3)
[02:54:17] 
[02:54:17] Assembly optimizations on if available.
[02:54:17] Entering M.D.
No option -tpi
starting mdrun 'Mini chaperonin'
500000 steps,   1000.0 ps.

[02:54:24] Protein: Mini chaperonin
[02:54:24] Writing local files
[02:54:24] Extra 3DNow boost OK.
[02:54:24] Extra SSE boost OK.
[02:54:25] Writing local files
[02:54:26] Completed 0 out of 500000 steps  (0 percent)
[04:54:43] Writing local files
[04:54:43] Completed 5000 out of 500000 steps  (1 percent)
[06:37:47] Writing local files
[06:37:47] Completed 10000 out of 500000 steps  (2 percent)
-------------------------------------------------------
Program Core_A0.exe, VERSION 3.3
Source code file: fatal.c, line: 342

Fatal error:
NaN detected: (ener[20])

-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[06:42:32] Gromacs error.
[06:42:32] 
[06:42:32] Folding@home Core Shutdown: UNKNOWN_ERROR
[06:42:33] CoreStatus = 79 (121)
[06:42:33] Client-core communications error: ERROR 0x79
[06:42:33] Deleting current work unit & continuing...
[06:42:50] - Preparing to get new work unit...
[06:42:50] + Attempting to get work packet
[06:42:50] - Connecting to assignment server
[06:42:50] - Successful: assigned to (171.64.122.138).
[06:42:50] + News From Folding@Home: Welcome to Folding@Home
[06:42:50] Loaded queue successfully.
[06:42:55] + Closed connections
[06:43:00] 
[06:43:00] + Processing work unit
[06:43:00] Core required: FahCore_a0.exe
[06:43:00] Core found.
[06:43:00] Working on Unit 07 [September 3 06:43:00]
[06:43:00] + Working ...
[06:43:00] 
[06:43:00] *------------------------------*
[06:43:00] Folding@Home Gromacs 3.3 Core
[06:43:00] Version 1.92 (April 17. 2007)
[06:43:00] 
[06:43:00] Preparing to commence simulation
[06:43:00] - Looking at optimizations...
[06:43:00] - Created dyn
[06:43:00] - Files status OK
[06:43:01] - Expanded 1168013 -> 6252409 (decompressed 535.3 percent)
[06:43:01] - Starting from initial work packet
[06:43:01] 
[06:43:01] Project: 782 (Run 0, Clone 77, Gen 3)
[06:43:01] 
[06:43:01] Assembly optimizations on if available.
[06:43:01] Entering M.D.
No option -tpi
starting mdrun 'Mini chaperonin'
500000 steps,   1000.0 ps.

[06:43:07] Protein: Mini chaperonin
[06:43:07] Writing local files
[06:43:08] Extra 3DNow boost OK.
[06:43:08] Extra SSE boost OK.
[06:43:09] Writing local files
[06:43:09] Completed 0 out of 500000 steps  (0 percent)
[08:13:16] Writing local files
[08:13:16] Completed 5000 out of 500000 steps  (1 percent)
[09:43:27] Writing local files
[09:43:28] Completed 10000 out of 500000 steps  (2 percent)
I wonder when automatic crash reporting will be part of the process. That doesn't seem like something that should be overlooked for this long.

For what it's worth, I booted into memtest86 the other day (which Ubuntu provides in the grub menu) and all the tests finished without errors.

Getting frustrated,

- Nate

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Posted: Thu Sep 04, 2008 12:07 am
by VijayPande
I agree it would be important to get better reporting here. However, please note a couple of items. Our non-SMP clients do crash reporting. Here's the problem for SMP: SMP has to use MPI, which means that there's a program in between our client and the core (mpirun). When the core crashes, mpirun doesn't give any useful info and so our client can't know that something bad has happened.

We're looking into what we need to do to work around this, but that's the situation. It's not that it's not been considered or that it's trivial. It's something very much on our minds. Hopefully we can have this resolved ASAP.

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Posted: Thu Sep 04, 2008 2:36 am
by P5-133XL
Perhaps what you need is an error file where you dump messages from the cores giving details of what has gone wrong or other strange events. Then the client simply sends what is in the error file whenever it sees one while doing a get/send a WU. After sending it clears it away.

Of course, that will require a program or person at the server level to analyze these error files making appropiate decisions as to what to do. At least you will get data that way.

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Posted: Thu Sep 04, 2008 3:27 am
by Baowoulf
You're forgetting that the program (mpirun) isn't F@H's program. So they wouldn't be able to change how it runs(what info it gives) at least not easily.

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Posted: Thu Sep 04, 2008 5:15 am
by P5-133XL
No, I'm not forgetting it -- I'm suggesting to bypass it by directly writing errors to a file from the cores. Errors do not have to travel from the cores, through mpirun, to the clients, but rather they go straight to the file and then the clients read the file and transmit the error. Perhaps I'm mis-intrerpreting, but Isn't what Dr. Pande said is that when a core crashes, the mpirun layer is not transmitting to the clients the core's error data. Doesn't this solve that specific problem?

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Posted: Thu Sep 04, 2008 9:06 pm
by arfyness
I have a uniprocessor machine ... Is there a different (previous?) Linux uniprocessor version of the client that I might try?

I keep getting assigned work units for the mini-chaperonin projects 781 and 782 (core a0)... and they have failed every single time, usually with ERROR 0x79 and sometimes with ERROR 0x0. The errors are not at the same percentage-done, so it would seem reasonable the fault is with the client or the core (not the WU run itself).

At this point, I might as well not be running the client at all on this machine. A pity, since this one is the only one I leave running 24/7.

-- Nate

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Posted: Thu Sep 04, 2008 9:13 pm
by John Naylor
Try changing the WU download size settings; if you have WUs set to small, set it to normal, or normal -> big. This should give you a different selection of work. You can also change it downwards if you wish, if the p781 and p782 are big WUs.

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Posted: Sat Sep 06, 2008 10:50 pm
by arfyness
I tried different combinations of big/small/normal, with different memory reporting (64mb is the minimum, i found out). Finally I disabled -advmethods and it couldn't find any work to do (normal / 220mb). After a while, it's using core 82 on project 2170 (which I can't find any info about).

We'll see what happens...

-- Nate

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Posted: Sun Sep 07, 2008 1:09 am
by Leoslocks
arfyness wrote: After a while, it's using core 82 on project 2170 (which I can't find any info about).
2170 171.65.103.160 p2170_lambda_obc_300K 1258 45.00 66.00 234.00 100 AMBER Description
Project Summary page is found in the links at the Folding Forum page header.