782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Moderators: Site Moderators, FAHC Science Team

Post Reply
arfyness
Posts: 13
Joined: Sun Aug 31, 2008 3:13 pm
Hardware configuration: 5 x WinXP (console version installed as service)
1 x Linux (manually, until I figure out what's wrong)
Asus nForce2 Mobo (A7N8X-E) w/ AMD AthlonXP 3200+ :: 1.0GB RAM :: Ubuntu Hardy 8.04.1
Location: Columbus, Ohio, USA

782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Post by arfyness »

So, I've got two more errors since I went to bed last night... Still on the same work unit.

This machine has been folding for almost 3 weeks, and still has not submitted a work unit yet. They always fail. Before my trouble was with project 781, which seems to be a similar simulation, using the same core a0.

Code: Select all

--- Opening Log file [September 1 14:49:09] 


# Linux Console Edition #######################################################
###############################################################################

                       Folding@Home Client Version 6.02

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /usr/local/folding
Executable: ./fah6


[14:49:09] - Ask before connecting: No
[14:49:09] - User name: Arfyness (Team 45104)
[14:49:09] - User ID: <removed>
[14:49:09] - Machine ID: 1
[14:49:09] 
[14:49:09] Loaded queue successfully.
[14:49:09] 
[14:49:09] + Processing work unit
[14:49:09] Core required: FahCore_a0.exe
[14:49:09] Core found.
[14:49:09] Working on Unit 05 [September 1 14:49:09]
[14:49:09] + Working ...
[14:49:09] 
[14:49:09] *------------------------------*
[14:49:09] Folding@Home Gromacs 3.3 Core
[14:49:09] Version 1.92 (April 17. 2007)
[14:49:09] 
[14:49:09] Preparing to commence simulation
[14:49:09] - Looking at optimizations...
[14:49:09] - Files status OK
[14:49:10] - Expanded 1168013 -> 6252409 (decompressed 535.3 percent)
[14:49:10] 
[14:49:10] Project: 782 (Run 0, Clone 77, Gen 3)
[14:49:10] 
[14:49:10] Assembly optimizations on if available.
[14:49:10] Entering M.D.
No option -tpi
(single precision)
starting mdrun 'Mini chaperonin'
500000 steps,   1000.0 ps.

[14:49:31] (Starting from checkpoint)
[14:49:31] Protein: Mini chaperonin
[14:49:31] Writing local files
[14:49:31] Completed 18850 out of 500000 steps  (3%)
[14:49:32] Extra 3DNow boost OK.
[14:49:32] Extra SSE boost OK.
[15:08:44] Writing local files
[15:08:44] Completed 20000 out of 500000 steps  (4 percent)

     < --- SNIP --- >

[00:02:29] Completed 140000 out of 500000 steps  (28 percent)
[01:30:49] Writing local files
[01:30:49] Completed 145000 out of 500000 steps  (29 percent)
-------------------------------------------------------
Program Core_A0.exe, VERSION 3.3
Source code file: fatal.c, line: 342

Fatal error:
NaN detected: (ener[25])

-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[02:53:48] Gromacs error.
[02:53:48] 
[02:53:48] Folding@home Core Shutdown: UNKNOWN_ERROR
[02:53:49] CoreStatus = 79 (121)
[02:53:49] Client-core communications error: ERROR 0x79
[02:53:49] Deleting current work unit & continuing...
[02:54:06] - Preparing to get new work unit...
[02:54:06] + Attempting to get work packet
[02:54:06] - Connecting to assignment server
[02:54:07] - Successful: assigned to (171.64.122.138).
[02:54:07] + News From Folding@Home: Welcome to Folding@Home
[02:54:07] Loaded queue successfully.
[02:54:12] + Closed connections
[02:54:17] 
[02:54:17] + Processing work unit
[02:54:17] Core required: FahCore_a0.exe
[02:54:17] Core found.
[02:54:17] Working on Unit 06 [September 3 02:54:17]
[02:54:17] + Working ...
[02:54:17] 
[02:54:17] *------------------------------*
[02:54:17] Folding@Home Gromacs 3.3 Core
[02:54:17] Version 1.92 (April 17. 2007)
[02:54:17] 
[02:54:17] Preparing to commence simulation
[02:54:17] - Looking at optimizations...
[02:54:17] - Created dyn
[02:54:17] - Files status OK
[02:54:17] - Expanded 1168013 -> 6252409 (decompressed 535.3 percent)
[02:54:17] - Starting from initial work packet
[02:54:17] 
[02:54:17] Project: 782 (Run 0, Clone 77, Gen 3)
[02:54:17] 
[02:54:17] Assembly optimizations on if available.
[02:54:17] Entering M.D.
No option -tpi
starting mdrun 'Mini chaperonin'
500000 steps,   1000.0 ps.

[02:54:24] Protein: Mini chaperonin
[02:54:24] Writing local files
[02:54:24] Extra 3DNow boost OK.
[02:54:24] Extra SSE boost OK.
[02:54:25] Writing local files
[02:54:26] Completed 0 out of 500000 steps  (0 percent)
[04:54:43] Writing local files
[04:54:43] Completed 5000 out of 500000 steps  (1 percent)
[06:37:47] Writing local files
[06:37:47] Completed 10000 out of 500000 steps  (2 percent)
-------------------------------------------------------
Program Core_A0.exe, VERSION 3.3
Source code file: fatal.c, line: 342

Fatal error:
NaN detected: (ener[20])

-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[06:42:32] Gromacs error.
[06:42:32] 
[06:42:32] Folding@home Core Shutdown: UNKNOWN_ERROR
[06:42:33] CoreStatus = 79 (121)
[06:42:33] Client-core communications error: ERROR 0x79
[06:42:33] Deleting current work unit & continuing...
[06:42:50] - Preparing to get new work unit...
[06:42:50] + Attempting to get work packet
[06:42:50] - Connecting to assignment server
[06:42:50] - Successful: assigned to (171.64.122.138).
[06:42:50] + News From Folding@Home: Welcome to Folding@Home
[06:42:50] Loaded queue successfully.
[06:42:55] + Closed connections
[06:43:00] 
[06:43:00] + Processing work unit
[06:43:00] Core required: FahCore_a0.exe
[06:43:00] Core found.
[06:43:00] Working on Unit 07 [September 3 06:43:00]
[06:43:00] + Working ...
[06:43:00] 
[06:43:00] *------------------------------*
[06:43:00] Folding@Home Gromacs 3.3 Core
[06:43:00] Version 1.92 (April 17. 2007)
[06:43:00] 
[06:43:00] Preparing to commence simulation
[06:43:00] - Looking at optimizations...
[06:43:00] - Created dyn
[06:43:00] - Files status OK
[06:43:01] - Expanded 1168013 -> 6252409 (decompressed 535.3 percent)
[06:43:01] - Starting from initial work packet
[06:43:01] 
[06:43:01] Project: 782 (Run 0, Clone 77, Gen 3)
[06:43:01] 
[06:43:01] Assembly optimizations on if available.
[06:43:01] Entering M.D.
No option -tpi
starting mdrun 'Mini chaperonin'
500000 steps,   1000.0 ps.

[06:43:07] Protein: Mini chaperonin
[06:43:07] Writing local files
[06:43:08] Extra 3DNow boost OK.
[06:43:08] Extra SSE boost OK.
[06:43:09] Writing local files
[06:43:09] Completed 0 out of 500000 steps  (0 percent)
[08:13:16] Writing local files
[08:13:16] Completed 5000 out of 500000 steps  (1 percent)
[09:43:27] Writing local files
[09:43:28] Completed 10000 out of 500000 steps  (2 percent)
I wonder when automatic crash reporting will be part of the process. That doesn't seem like something that should be overlooked for this long.

For what it's worth, I booted into memtest86 the other day (which Ubuntu provides in the grub menu) and all the tests finished without errors.

Getting frustrated,

- Nate
:: ./fah6 v6.02 :: Ubuntu Hardy 8.04.1 :: Asus A7N8X-E (nForce2) :: AMD AthlonXP 3200+ :: 1.0GB RAM :: 1 cpu ::

Image
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Post by VijayPande »

I agree it would be important to get better reporting here. However, please note a couple of items. Our non-SMP clients do crash reporting. Here's the problem for SMP: SMP has to use MPI, which means that there's a program in between our client and the core (mpirun). When the core crashes, mpirun doesn't give any useful info and so our client can't know that something bad has happened.

We're looking into what we need to do to work around this, but that's the situation. It's not that it's not been considered or that it's trivial. It's something very much on our minds. Hopefully we can have this resolved ASAP.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Post by P5-133XL »

Perhaps what you need is an error file where you dump messages from the cores giving details of what has gone wrong or other strange events. Then the client simply sends what is in the error file whenever it sees one while doing a get/send a WU. After sending it clears it away.

Of course, that will require a program or person at the server level to analyze these error files making appropiate decisions as to what to do. At least you will get data that way.
Image
Baowoulf
Posts: 208
Joined: Wed Dec 12, 2007 8:44 pm
Hardware configuration: Pentium 4 2.8 GHz, 512MB DDR Ram, 128MB Radeon 9800, Creative Soundblaster Audigy 4 Pro
Location: Jupiter 6
Contact:

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Post by Baowoulf »

You're forgetting that the program (mpirun) isn't F@H's program. So they wouldn't be able to change how it runs(what info it gives) at least not easily.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Post by P5-133XL »

No, I'm not forgetting it -- I'm suggesting to bypass it by directly writing errors to a file from the cores. Errors do not have to travel from the cores, through mpirun, to the clients, but rather they go straight to the file and then the clients read the file and transmit the error. Perhaps I'm mis-intrerpreting, but Isn't what Dr. Pande said is that when a core crashes, the mpirun layer is not transmitting to the clients the core's error data. Doesn't this solve that specific problem?
Image
arfyness
Posts: 13
Joined: Sun Aug 31, 2008 3:13 pm
Hardware configuration: 5 x WinXP (console version installed as service)
1 x Linux (manually, until I figure out what's wrong)
Asus nForce2 Mobo (A7N8X-E) w/ AMD AthlonXP 3200+ :: 1.0GB RAM :: Ubuntu Hardy 8.04.1
Location: Columbus, Ohio, USA

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Post by arfyness »

I have a uniprocessor machine ... Is there a different (previous?) Linux uniprocessor version of the client that I might try?

I keep getting assigned work units for the mini-chaperonin projects 781 and 782 (core a0)... and they have failed every single time, usually with ERROR 0x79 and sometimes with ERROR 0x0. The errors are not at the same percentage-done, so it would seem reasonable the fault is with the client or the core (not the WU run itself).

At this point, I might as well not be running the client at all on this machine. A pity, since this one is the only one I leave running 24/7.

-- Nate
:: ./fah6 v6.02 :: Ubuntu Hardy 8.04.1 :: Asus A7N8X-E (nForce2) :: AMD AthlonXP 3200+ :: 1.0GB RAM :: 1 cpu ::

Image
John Naylor
Posts: 357
Joined: Mon Dec 03, 2007 4:36 pm
Hardware configuration: Q9450 OC @ 3.2GHz (Win7 Home Premium) - SMP2
E7500 OC @ 3.66GHz (Windows Home Server) - SMP2
i5-3750k @ 3.8GHz (Win7 Pro) - SMP2
Location: University of Birmingham, UK

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Post by John Naylor »

Try changing the WU download size settings; if you have WUs set to small, set it to normal, or normal -> big. This should give you a different selection of work. You can also change it downwards if you wish, if the p781 and p782 are big WUs.
Folding whatever I'm sent since March 2006 :) Beta testing since October 2006. www.FAH-Addict.net Administrator since August 2009.
arfyness
Posts: 13
Joined: Sun Aug 31, 2008 3:13 pm
Hardware configuration: 5 x WinXP (console version installed as service)
1 x Linux (manually, until I figure out what's wrong)
Asus nForce2 Mobo (A7N8X-E) w/ AMD AthlonXP 3200+ :: 1.0GB RAM :: Ubuntu Hardy 8.04.1
Location: Columbus, Ohio, USA

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Post by arfyness »

I tried different combinations of big/small/normal, with different memory reporting (64mb is the minimum, i found out). Finally I disabled -advmethods and it couldn't find any work to do (normal / 220mb). After a while, it's using core 82 on project 2170 (which I can't find any info about).

We'll see what happens...

-- Nate
:: ./fah6 v6.02 :: Ubuntu Hardy 8.04.1 :: Asus A7N8X-E (nForce2) :: AMD AthlonXP 3200+ :: 1.0GB RAM :: 1 cpu ::

Image
Leoslocks
Posts: 120
Joined: Fri Jan 25, 2008 3:20 am
Hardware configuration: Q6600 | P35-DQ6 | Crucial 2 x 1 GB ram | VisionTek 3870
GPU2 Version 6.20| CPU three 6.20 Clients

Re: 782 (Run 0, Clone 77, Gen 3) - Gives ERROR 0x79's - Core a0

Post by Leoslocks »

arfyness wrote: After a while, it's using core 82 on project 2170 (which I can't find any info about).
2170 171.65.103.160 p2170_lambda_obc_300K 1258 45.00 66.00 234.00 100 AMBER Description
Project Summary page is found in the links at the Folding Forum page header.
Post Reply