Project 781 -- FahCore_a0.exe ERRORs -- Deletes WUs

Moderators: Site Moderators, FAHC Science Team

Post Reply
arfyness
Posts: 13
Joined: Sun Aug 31, 2008 3:13 pm
Hardware configuration: 5 x WinXP (console version installed as service)
1 x Linux (manually, until I figure out what's wrong)
Asus nForce2 Mobo (A7N8X-E) w/ AMD AthlonXP 3200+ :: 1.0GB RAM :: Ubuntu Hardy 8.04.1
Location: Columbus, Ohio, USA

Project 781 -- FahCore_a0.exe ERRORs -- Deletes WUs

Post by arfyness »

I'm not sure what's going on here, but I have yet to successfully submit a work unit on my Ubuntu 8.04.1 machine.

To the point - pasted below the asterisks (and below the config stuff) are the errors I've been getting on every single work unit on the Gromacs 3.3 core (FahCore_a0.exe). I am using the latest available Linux client. There were a few errors before the work units pasted below (the logs don't go back that far). This is my second attempt on this computer to get fah6 working properly -- I had completely erased all the folding stuff (I don't have those logs either) and started from scratch.

As I recall, there were at least two previous instances of "ERROR 0x0" whatever that is, plus one or two with different hex codes. And it seems I'm only getting assigned core_a0 work units, none of which will finish...

I've checked the file permissions and the user running the client has full permissions on the containing directory and everything below it. I don't see anything at the OS level that would cause trouble for the client. There are GB's free on that partition, and RAM use hovers around 50% (I have 1GB, plus 1.4GB swap).

I have yet to successfully submit *any* results with this computer, and it's been running now for about three weeks. I'd really like to have this machine contribute to the project. Here's some details about my configuration:

CONFIG STUFF

Code: Select all

nate@Redtail:/usr/local/folding> cat client.cfg
[settings]
username=Arfyness
team=45104
passkey=<<REMOVED>>
asknet=no
machineid=1
local=10

[http]
active=no
host=localhost
port=8080

[clienttype]
memory=200
type=3

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 
nate@Redtail:/usr/local/folding> uname -a
Linux Redtail 2.6.24-19-generic #1 SMP Wed Aug 20 22:56:21 UTC 2008 i686 GNU/Linux

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 
nate@Redtail:/usr/local/folding> ./fah6 -queueinfo

[--- SNIP --- ]

CURRENT QUEUE:
00  EMPTY   
01  EMPTY   
02  EMPTY   
03  EMPTY   
04 *READY     a0 171.64.122.138:8080  August 30 12:54 | January 31 12:54
[      P781R0C83F2 ]
05  EMPTY   
06  EMPTY   
07  EMPTY   
08  EMPTY   
09  EMPTY   

Folding@Home Client Shutdown.
And here are pastes from the log file for two of the errors (the only ones I still have in the logs)

Unit 02 - August 26
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Code: Select all

--- Opening Log file [August 25 16:58:05] 
     < --- SNIP --- >
[16:58:06] Loaded queue successfully.
[16:58:06] 
[16:58:06] + Processing work unit
[16:58:06] Core required: FahCore_a0.exe
[16:58:06] Core found.
[16:58:06] Working on Unit 02 [August 25 16:58:06]
[16:58:06] + Working ...
[16:58:06] 
[16:58:06] *------------------------------*
[16:58:06] Folding@Home Gromacs 3.3 Core
[16:58:06] Version 1.92 (April 17. 2007)
[16:58:06] 
[16:58:06] Preparing to commence simulation
[16:58:06] - Looking at optimizations...
[16:58:06] - Files status OK
[16:58:06] - Expanded 1169210 -> 6252409 (decompressed 534.7 percent)
[16:58:06] 
[16:58:06] Project: 781 (Run 0, Clone 83, Gen 2)
[16:58:06] 
[16:58:06] Assembly optimizations on if available.
[16:58:06] Entering M.D.
[16:58:28] (Starting from checkpoint)
[16:58:28] Protein: Mini chaperonin
[16:58:28] Writing local files
[16:58:28] Completed 116429 out of 500000 steps  (23%)
[16:58:28] Extra 3DNow boost OK.
[16:58:28] Extra SSE boost OK.
[18:00:50] Writing local files
[18:00:51] Completed 120000 out of 500000 steps  (24 percent)
[19:23:53] Writing local files
     < --- SNIP --- >
[14:30:01] Completed 190000 out of 500000 steps  (38 percent)
[16:00:06] Writing local files
[16:00:06] Completed 195000 out of 500000 steps  (39 percent)
[16:09:25] CoreStatus = 0 (0)
[16:09:25] Client-core communications error: ERROR 0x0
[16:09:25] Deleting current work unit & continuing...
[16:09:43] - Preparing to get new work unit...
[16:09:43] + Attempting to get work packet
[16:09:43] - Connecting to assignment server
[16:09:44] - Successful: assigned to (171.64.122.138).
[16:09:44] + News From Folding@Home: Welcome to Folding@Home
[16:09:44] Loaded queue successfully.
[16:09:49] + Closed connections
[16:09:54]
[16:09:54] + Processing work unit
[16:09:54] Core required: FahCore_a0.exe
[16:09:54] Core found.
[16:09:54] Working on Unit 03 [August 26 16:09:54]
[16:09:54] + Working ...
[16:09:54]
[16:09:54] *------------------------------*
[16:09:54] Folding@Home Gromacs 3.3 Core
[16:09:54] Version 1.92 (April 17. 2007)
[16:09:54]
[16:09:54] Preparing to commence simulation
[16:09:54] - Looking at optimizations...
[16:09:54] - Created dyn
[16:09:54] - Files status OK
[16:09:55] - Expanded 1169210 -> 6252409 (decompressed 534.7 percent)
[16:09:55] - Starting from initial work packet
[16:09:55]
[16:09:55] Project: 781 (Run 0, Clone 83, Gen 2)
[16:09:55]
[16:09:55] Assembly optimizations on if available.
[16:09:55] Entering M.D.
[16:10:01] Protein: Mini chaperonin
[16:10:01] Writing local files
[16:10:02] Extra 3DNow boost OK.
[16:10:02] Extra SSE boost OK.
[16:10:03] Writing local files
[16:10:03] Completed 0 out of 500000 steps  (0 percent)
[17:38:44] Writing local files
[17:38:44] Completed 5000 out of 500000 steps  (1 percent)


Unit 03 - August 30
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Code: Select all

--- Opening Log file [August 27 16:38:17]
     < --- SNIP --- >
[16:38:17] Loaded queue successfully.
[16:38:17]
[16:38:17] + Processing work unit
[16:38:17] Core required: FahCore_a0.exe
[16:38:17] Core found.
[16:38:17] Working on Unit 03 [August 27 16:38:17]
[16:38:17] + Working ...
[16:38:18]
[16:38:18] *------------------------------*
[16:38:18] Folding@Home Gromacs 3.3 Core
[16:38:18] Version 1.92 (April 17. 2007)
[16:38:18]
[16:38:18] Preparing to commence simulation
[16:38:18] - Looking at optimizations...
[16:38:18] - Files status OK
[16:38:20] - Expanded 1169210 -> 6252409 (decompressed 534.7 percent)
[16:38:20]
[16:38:20] Project: 781 (Run 0, Clone 83, Gen 2)
[16:38:20]
[16:38:20] Assembly optimizations on if available.
[16:38:20] Entering M.D.
[16:38:43] (Starting from checkpoint)
[16:38:43] Protein: Mini chaperonin
[16:38:43] Writing local files
[16:38:43] Completed 64865 out of 500000 steps  (12%)
[16:38:44] Extra 3DNow boost OK.
[16:38:44] Extra SSE boost OK.
[16:42:29] Writing local files
[16:42:29] Completed 65000 out of 500000 steps  (13 percent)
     < --- SNIP --- >
[10:59:17] Completed 290000 out of 500000 steps  (58 percent)
[12:30:02] Writing local files
[12:30:02] Completed 295000 out of 500000 steps  (59 percent)
-------------------------------------------------------
Program Core_A0.exe, VERSION 3.3
Source code file: fatal.c, line: 342

Fatal error:
NaN detected: (ener[20])

-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[12:54:12] Gromacs error.
[12:54:12]
[12:54:12] Folding@home Core Shutdown: UNKNOWN_ERROR
[12:54:13] CoreStatus = 79 (121)
[12:54:13] Client-core communications error: ERROR 0x79
[12:54:13] Deleting current work unit & continuing...
[12:54:30] - Preparing to get new work unit...
[12:54:30] + Attempting to get work packet
[12:54:30] - Connecting to assignment server
[12:54:31] - Successful: assigned to (171.64.122.138).
[12:54:31] + News From Folding@Home: Welcome to Folding@Home
[12:54:31] Loaded queue successfully.
[12:54:36] + Closed connections
[12:54:41]
[12:54:41] + Processing work unit
[12:54:41] Core required: FahCore_a0.exe
[12:54:41] Core found.
[12:54:41] Working on Unit 04 [August 30 12:54:41]
[12:54:41] + Working ...
[12:54:41]
[12:54:41] *------------------------------*
[12:54:41] Folding@Home Gromacs 3.3 Core
[12:54:41] Version 1.92 (April 17. 2007)
[12:54:41]
[12:54:41] Preparing to commence simulation
[12:54:41] - Looking at optimizations...
[12:54:41] - Created dyn
[12:54:41] - Files status OK
[12:54:42] - Expanded 1169210 -> 6252409 (decompressed 534.7 percent)
[12:54:42] - Starting from initial work packet
[12:54:42]
[12:54:42] Project: 781 (Run 0, Clone 83, Gen 2)
[12:54:42]
[12:54:42] Assembly optimizations on if available.
[12:54:42] Entering M.D.
No option -tpi
starting mdrun 'Mini chaperonin'
500000 steps,   1000.0 ps.

[12:54:48] Protein: Mini chaperonin
[12:54:48] Writing local files
[12:54:49] Extra 3DNow boost OK.
[12:54:49] Extra SSE boost OK.
[12:54:50] Writing local files
[12:54:50] Completed 0 out of 500000 steps  (0 percent)
[14:25:38] Writing local files

Like I said, I've had at least two prior cases of "ERROR 0x0" and another one or two with different hex codes, and all are with FahCore_a0.exe. Any guesses whether these are just flukes? Is anyone else having these "results?"

Right now I'm running this from xterm so I can monitor the goings on of it all. CTRL-C seems the best way to end the task before shutting down / rebooting. That doesn't seem to pose any problems.

I hope to have this resolved because, as you see in the logs, this is a HUGE work unit, and takes DAYS before it crashes / restarts.

Thanks,
-- Nate
:: ./fah6 v6.02 :: Ubuntu Hardy 8.04.1 :: Asus A7N8X-E (nForce2) :: AMD AthlonXP 3200+ :: 1.0GB RAM :: 1 cpu ::

Image
arfyness
Posts: 13
Joined: Sun Aug 31, 2008 3:13 pm
Hardware configuration: 5 x WinXP (console version installed as service)
1 x Linux (manually, until I figure out what's wrong)
Asus nForce2 Mobo (A7N8X-E) w/ AMD AthlonXP 3200+ :: 1.0GB RAM :: Ubuntu Hardy 8.04.1
Location: Columbus, Ohio, USA

Re: Project 781 -- FahCore_a0.exe ERRORs -- Deletes WUs

Post by arfyness »

UPDATE:

I'm not sure if this will have any effect whatsoever, but it crossed my mind to delete all the core FahCore_a0.exe so that it'll have to download the latest core, in case anything has changed. I have seen the notes about the A2 core. Of course, I stopped the client, deleted it, and restarted. (Even though, under Linux, it shouldn't really matter, as the file will still be there for anything using it until its handle is closed.) Anyway, it downloaded the core, and picked right back up as it was.

BTW, It's still working the same mini-chaperonin unit as the failures above, now in slot 04 - Project: 781 (Run 0, Clone 83, Gen 2)

I'll post anything further here, feel free to comment / speculate / help / etc.

--Nate
:: ./fah6 v6.02 :: Ubuntu Hardy 8.04.1 :: Asus A7N8X-E (nForce2) :: AMD AthlonXP 3200+ :: 1.0GB RAM :: 1 cpu ::

Image
arfyness
Posts: 13
Joined: Sun Aug 31, 2008 3:13 pm
Hardware configuration: 5 x WinXP (console version installed as service)
1 x Linux (manually, until I figure out what's wrong)
Asus nForce2 Mobo (A7N8X-E) w/ AMD AthlonXP 3200+ :: 1.0GB RAM :: Ubuntu Hardy 8.04.1
Location: Columbus, Ohio, USA

Re: Project 781 -- FahCore_a0.exe ERRORs -- Deletes WUs

Post by arfyness »

Well, okay. that was fast, Crashed again with ERROR 0x0

Code: Select all

[08:14:51] Completed 115000 out of 500000 steps  (23 percent)
[09:32:58] Writing local files
[09:32:58] Completed 120000 out of 500000 steps  (24 percent)
[09:38:29] CoreStatus = 0 (0)
[09:38:29] Client-core communications error: ERROR 0x0
[09:38:29] Deleting current work unit & continuing...
[09:38:46] - Preparing to get new work unit...
[09:38:46] + Attempting to get work packet
[09:38:46] - Connecting to assignment server
[09:38:46] - Successful: assigned to (171.64.122.138).
[09:38:46] + News From Folding@Home: Welcome to Folding@Home
[09:38:47] Loaded queue successfully.
[09:38:47] - Attempt #1  to get work failed, and no other work to do.
             Waiting before retry.
[09:39:04] + Attempting to get work packet
[09:39:04] - Connecting to assignment server
[09:39:04] - Successful: assigned to (171.64.122.138).
[09:39:04] + News From Folding@Home: Welcome to Folding@Home
[09:39:05] Loaded queue successfully.
[09:39:09] + Closed connections
[09:39:14] 
[09:39:14] + Processing work unit
[09:39:14] Core required: FahCore_a0.exe
[09:39:14] Core found.
[09:39:14] Working on Unit 05 [September 1 09:39:14]
[09:39:14] + Working ...
[09:39:15] 
[09:39:15] *------------------------------*
[09:39:15] Folding@Home Gromacs 3.3 Core
[09:39:15] Version 1.92 (April 17. 2007)
[09:39:15] 
[09:39:15] Preparing to commence simulation
[09:39:15] - Looking at optimizations...
[09:39:15] - Created dyn
[09:39:15] - Files status OK
[09:39:15] - Expanded 1168013 -> 6252409 (decompressed 535.3 percent)
[09:39:15] - Starting from initial work packet
[09:39:15] 
[09:39:15] Project: 782 (Run 0, Clone 77, Gen 3)
[09:39:15] 
[09:39:15] Assembly optimizations on if available.
[09:39:15] Entering M.D.
No option -tpi
starting mdrun 'Mini chaperonin'
500000 steps,   1000.0 ps.

[09:39:22] Protein: Mini chaperonin
[09:39:22] Writing local files
[09:39:22] Extra 3DNow boost OK.
[09:39:22] Extra SSE boost OK.
[09:39:23] Writing local files
[09:39:23] Completed 0 out of 500000 steps  (0 percent)
[10:57:19] Writing local files
[10:57:19] Completed 5000 out of 500000 steps  (1 percent)
At least this time it was only 24% (only about 3 days into this WU). And thankfully I've been assigned a different one this time, although it's still a mini-chaperonin one. - Project: 782 (Run 0, Clone 77, Gen 3). From the look of it, this one takes about 1h12m per percent, whereas that last one took more than 2h40m per percent.

I decided to shut down the core and use qfix. Here are my results:

Code: Select all

root@Redtail:/usr/local/folding# ./qfix
entry 6, status 0, address 0.0.0.0
entry 7, status 0, address 0.0.0.0
entry 8, status 0, address 0.0.0.0
entry 9, status 0, address 0.0.0.0
entry 0, status 0, address 0.0.0.0
entry 1, status 0, address 169.230.26.30:8080
entry 2, status 0, address 171.64.122.138:8080
entry 3, status 0, address 171.64.122.138:8080
entry 4, status 0, address 171.64.122.138:8080
entry 5, status 1, address 171.64.122.138:8080
File is OK
root@Redtail:/usr/local/folding#
Then I restarted the ./fah6 and it looks like normal...

Code: Select all

--- Opening Log file [September 1 14:49:09] 


# Linux Console Edition #######################################################
###############################################################################

                       Folding@Home Client Version 6.02

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /usr/local/folding
Executable: ./fah6


[14:49:09] - Ask before connecting: No
[14:49:09] - User name: Arfyness (Team 45104)
[14:49:09] - User ID: ###
[14:49:09] - Machine ID: 1
[14:49:09] 
[14:49:09] Loaded queue successfully.
[14:49:09] 
[14:49:09] + Processing work unit
[14:49:09] Core required: FahCore_a0.exe
[14:49:09] Core found.
[14:49:09] Working on Unit 05 [September 1 14:49:09]
[14:49:09] + Working ...
[14:49:09] 
[14:49:09] *------------------------------*
[14:49:09] Folding@Home Gromacs 3.3 Core
[14:49:09] Version 1.92 (April 17. 2007)
[14:49:09] 
[14:49:09] Preparing to commence simulation
[14:49:09] - Looking at optimizations...
[14:49:09] - Files status OK
[14:49:10] - Expanded 1168013 -> 6252409 (decompressed 535.3 percent)
[14:49:10] 
[14:49:10] Project: 782 (Run 0, Clone 77, Gen 3)
[14:49:10] 
[14:49:10] Assembly optimizations on if available.
[14:49:10] Entering M.D.
No option -tpi
(single precision)
starting mdrun 'Mini chaperonin'
500000 steps,   1000.0 ps.

[14:49:31] (Starting from checkpoint)
[14:49:31] Protein: Mini chaperonin
[14:49:31] Writing local files
[14:49:31] Completed 18850 out of 500000 steps  (3%)
[14:49:32] Extra 3DNow boost OK.
[14:49:32] Extra SSE boost OK.
WTF ... I thought it was going to do something. I guess I'll just let it run ... If it happens again, I'm going to change some options, maybe indicating less memory available, or small units or something.

-- Nate
:: ./fah6 v6.02 :: Ubuntu Hardy 8.04.1 :: Asus A7N8X-E (nForce2) :: AMD AthlonXP 3200+ :: 1.0GB RAM :: 1 cpu ::

Image
Post Reply