2605 (Run 0, Clone 540, Gen 46) CoreStatus = 66 (102)

Moderators: Site Moderators, FAHC Science Team

Post Reply
parkut
Posts: 363
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

2605 (Run 0, Clone 540, Gen 46) CoreStatus = 66 (102)

Post by parkut »

This particular work unit simply won't start, it immediately SIGTERMS

model name : Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz
cpu MHz : 1998.000
cache size : 4096 KB
Memory: 975.99 MB physical, 1.94 GB virtual
...
Current Work Unit
-----------------
Name: Protein in POPC
Tag: P2605R0C540G46
Download time: May 21 22:54:05
Due time: May 25 22:54:05
Progress: 0% [__________]

Code: Select all


./fah6 -verbosity 9 -smp

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

2 cores detected


--- Opening Log file [May 21 23:48:51] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /root/fah6
Executable: ./fah6
Arguments: -verbosity 9 -smp 

[23:48:51] - Ask before connecting: No
[23:48:51] - User name: parkut (Team 4)
[23:48:51] - User ID: 3DAEED787A87E4F8
[23:48:51] - Machine ID: 1
[23:48:51] 
[23:48:51] Loaded queue successfully.
[23:48:51] 
[23:48:51] + Processing work unit
[23:48:51] Core required: FahCore_a1.exe
[23:48:51] Core found.
[23:48:51] - Autosending finished units...
[23:48:51] Trying to send all finished work units
[23:48:51] + No unsent completed units remaining.
[23:48:51] - Autosend completed
[23:48:51] Working on Unit 06 [May 21 23:48:51]
[23:48:51] + Working ...
[23:48:51] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 10178 -version 602'

[23:48:51] 
[23:48:51] *------------------------------*
[23:48:51] Folding@Home Gromacs SMP Core
[23:48:51] Version 1.74 (November 27, 2006)
[23:48:51] 
[23:48:51] Preparing to commence simulation
[23:48:51] - Ensuring status. Please wait.
[23:48:51] 
[23:48:51] Project: 2605 (Run 0, Clone 540, Gen 46)
[23:48:51] 
[23:48:51] Assembly optimizations on if available.
[23:48:51] Entering M.D.
[23:49:08] - Expanded 2420849 -> 12854153 (decompressed 530.9 percent)
[23:49:08] 
[23:49:08] Project: 2605 (Run 0, Clone 540, Gen 46)
[23:49:08] 
[23:49:08] Entering M.D.
NNODES=4, MYRANK=3, HOSTNAME=conroe7.parkut.com
NNODES=4, MYRANK=0, HOSTNAME=conroe7.parkut.com
NNODES=4, MYRANK=2, HOSTNAME=conroe7.parkut.com
NNODES=4, MYRANK=1, HOSTNAME=conroe7.parkut.com
NODEID=2 argc=15
NODEID=0 argc=15
NODEID=1 argc=15
NODEID=3 argc=15
      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2004, The GROMACS development team,
            check out http://www.gromacs.org for more information.

        This inclusion of Gromacs code in the Folding@Home Core is under
        a special license (see http://folding.stanford.edu/gromacs.html)
         specially granted to Stanford by the copyright holders. If you
          are interested in using Gromacs, visit www.gromacs.org where
                you can download a free version of Gromacs under
         the terms of the GNU General Public License (GPL) as published
       by the Free Software Foundation; either version 2 of the License,
                     or (at your option) any later version.

(single precision)
starting mdrun 'Protein in POPC'
500000 steps,   1000.0 ps.

[23:49:16] s
[23:49:16] otein: ProteExtra SSE boost OK.
[23:49:16] ocal files
[23:49:16] Extra SSE boost OK.
[23:49:16] Finalizing output
[23:49:16] nt)
[23:49:16] 
[23:49:16] Folding@home Core Shutdown: INTERRUPTED
[0]0:Return code = 102
[0]1:Return code = 0, signaled with Segmentation fault
[0]2:Return code = 0, signaled with Segmentation fault
[0]3:Return code = 0, signaled with Segmentation fault
[23:49:20] CoreStatus = 66 (102)
[23:49:20] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[23:49:20] Killing all core threads

Folding@Home Client Shutdown.


parkut
Posts: 363
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

Re: 2605 (Run 0, Clone 540, Gen 46) CoreStatus = 66 (102)

Post by parkut »

After 16 attempts at restarting and having the same immediate SIGTERM
I deleted the work files, queue.dat and machinedependant.dat files
restarted and picked up a different 2605, which is now running normally.
sortofageek
Site Admin
Posts: 3110
Joined: Fri Nov 30, 2007 8:06 pm
Location: Team Helix
Contact:

Re: 2605 (Run 0, Clone 540, Gen 46) CoreStatus = 66 (102)

Post by sortofageek »

That WU has not been returned by anyone as of this time.
Xilikon
Posts: 155
Joined: Sun Dec 02, 2007 1:34 pm

Re: 2605 (Run 0, Clone 540, Gen 46) CoreStatus = 66 (102)

Post by Xilikon »

Another report of that WU on one box, which even shut down after 3 tries (each try stopped with a 0x0 error) :

Code: Select all

[21:50:25] Writing local files
[21:50:25] Completed 410000 out of 500000 steps  (82 percent)
[22:02:07] Writing local files
[22:02:07] Completed 415000 out of 500000 steps  (83 percent)
[22:13:48] Writing local files
[22:13:48] Completed 420000 out of 500000 steps  (84 percent)
[22:25:30] Writing local files
[22:25:30] Completed 425000 out of 500000 steps  (85 percent)
[22:37:18] Writing local files
[22:37:18] Completed 430000 out of 500000 steps  (86 percent)
[22:49:06] Writing local files
[22:49:06] Completed 435000 out of 500000 steps  (87 percent)
[23:00:54] Writing local files
[23:00:54] Completed 440000 out of 500000 steps  (88 percent)
[23:12:41] Writing local files
[23:12:41] Completed 445000 out of 500000 steps  (89 percent)
[23:24:20] Writing local files
[23:24:20] Completed 450000 out of 500000 steps  (90 percent)
[23:36:09] Writing local files
[23:36:09] Completed 455000 out of 500000 steps  (91 percent)
[23:47:59] Writing local files
[23:47:59] Completed 460000 out of 500000 steps  (92 percent)
[23:59:49] Writing local files
[23:59:49] Completed 465000 out of 500000 steps  (93 percent)
[00:11:39] Writing local files
[00:11:39] Completed 470000 out of 500000 steps  (94 percent)
[00:23:29] Writing local files
[00:23:29] Completed 475000 out of 500000 steps  (95 percent)
[00:35:18] Writing local files
[00:35:19] Completed 480000 out of 500000 steps  (96 percent)
[00:44:21] - Autosending finished units...
[00:44:21] Trying to send all finished work units
[00:44:21] + No unsent completed units remaining.
[00:44:21] - Autosend completed
[00:47:08] Writing local files
[00:47:08] Completed 485000 out of 500000 steps  (97 percent)
[00:58:56] Writing local files
[00:58:56] Completed 490000 out of 500000 steps  (98 percent)
[01:10:45] Writing local files
[01:10:45] Completed 495000 out of 500000 steps  (99 percent)
[01:22:35] Writing local files
[01:22:35] Completed 500000 out of 500000 steps  (100 percent)
[01:22:35] Writing final coordinates.
[01:22:35] Past main M.D. loop
[01:22:35] Will end MPI now
[01:23:35] 
[01:23:35] Finished Work Unit:
[01:23:35] - Reading up to 3718704 from "work/wudata_09.arc": Read 3718704
[01:23:35] - Reading up to 1770828 from "work/wudata_09.xtc": Read 1770828
[01:23:35] goefile size: 0
[01:23:35] logfile size: 16912
[01:23:35] Leaving Run
[01:23:40] - Writing 5510844 bytes of core data to disk...
[01:23:40]   ... Done.
[01:23:46] - Shutting down core
[01:23:46] 
[01:23:46] Folding@home Core Shutdown: FINISHED_UNIT
[01:24:01] CoreStatus = 64 (100)
[01:24:01] Unit 9 finished with 79 percent of time to deadline remaining.
[01:24:01] Updated performance fraction: 0.792123
[01:24:01] Sending work to server


[01:24:01] + Attempting to send results
[01:24:01] - Reading file work/wuresults_09.dat from core
[01:24:01]   (Read 5510844 bytes from disk)
[01:24:01] Connecting to http://171.64.65.56:8080/
[01:25:00] Posted data.
[01:25:00] Initial: 0000; - Uploaded at ~89 kB/s
[01:25:01] - Averaged speed for that direction ~89 kB/s
[01:25:01] + Results successfully sent
[01:25:01] Thank you for your contribution to Folding@Home.
[01:25:01] + Number of Units Completed: 9

[01:29:08] - Warning: Could not delete all work unit files (9): Core returned invalid code
[01:29:08] Trying to send all finished work units
[01:29:08] + No unsent completed units remaining.
[01:29:08] - Preparing to get new work unit...
[01:29:08] + Attempting to get work packet
[01:29:08] - Will indicate memory of 768 MB
[01:29:08] - Connecting to assignment server
[01:29:08] Connecting to http://assign.stanford.edu:8080/
[01:29:08] Posted data.
[01:29:08] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[01:29:08] + News From Folding@Home: Welcome to Folding@Home
[01:29:08] Loaded queue successfully.
[01:29:08] Connecting to http://171.64.65.56:8080/
[01:29:11] Posted data.
[01:29:11] Initial: 0000; - Receiving payload (expected size: 2421361)
[01:29:14] - Downloaded at ~788 kB/s
[01:29:14] - Averaged speed for that direction ~729 kB/s
[01:29:14] + Received work.
[01:29:14] Trying to send all finished work units
[01:29:14] + No unsent completed units remaining.
[01:29:14] + Closed connections
[01:29:14] 
[01:29:14] + Processing work unit
[01:29:14] Core required: FahCore_a1.exe
[01:29:14] Core found.
[01:29:14] Working on Unit 00 [July 2 01:29:14]
[01:29:14] + Working ...
[01:29:14] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 00 -priority 96 -checkpoint 20 -verbose -lifeline 5410 -version 602'

[01:29:14] 
[01:29:14] *------------------------------*
[01:29:14] Folding@Home Gromacs SMP Core
[01:29:14] Version 1.74 (November 27, 2006)
[01:29:14] 
[01:29:14] Preparing to commence simulation
[01:29:14] - Ensuring status. Please wait.
[01:29:15] - Starting from initial work packet
[01:29:15] 
[01:29:15] Project: 2605 (Run 0, Clone 540, Gen 46)
[01:29:15] 
[01:29:15] Assembly optimizations on if available.
[01:29:15] Entering M.D.
[01:29:32]  percent)
[01:29:32] - Starting from initial work packet
[01:29:32] 
[01:29:32] Project: 2605 (Run 0, Clone 540, Gen 46)
[01:29:32] 
[01:29:32] Entering M.D.
[01:29:39] Protein: Protein in POPC
[01:29:39] Writing local files
[01:29:39] Extra SSE boost OK.
[01:29:44] CoreStatus = 0 (0)
[01:29:44] Client-core communications error: ERROR 0x0
[01:29:44] Deleting current work unit & continuing...
[01:34:08] - Warning: Could not delete all work unit files (0): Core returned invalid code
[01:34:08] Trying to send all finished work units
[01:34:08] + No unsent completed units remaining.
[01:34:08] - Preparing to get new work unit...
[01:34:08] + Attempting to get work packet
[01:34:08] - Will indicate memory of 768 MB
[01:34:08] - Connecting to assignment server
[01:34:08] Connecting to http://assign.stanford.edu:8080/
[01:34:08] Posted data.
[01:34:08] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[01:34:08] + News From Folding@Home: Welcome to Folding@Home
[01:34:08] Loaded queue successfully.
[01:34:08] Connecting to http://171.64.65.56:8080/
[01:34:11] Posted data.
[01:34:11] Initial: 0000; - Receiving payload (expected size: 2421361)
[01:34:14] - Downloaded at ~788 kB/s
[01:34:14] - Averaged speed for that direction ~741 kB/s
[01:34:14] + Received work.
[01:34:14] + Closed connections
[01:34:19] 
[01:34:19] + Processing work unit
[01:34:19] Core required: FahCore_a1.exe
[01:34:19] Core found.
[01:34:19] Working on Unit 01 [July 2 01:34:19]
[01:34:19] + Working ...
[01:34:19] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 01 -priority 96 -checkpoint 20 -verbose -lifeline 5410 -version 602'

[01:34:19] 
[01:34:19] *------------------------------*
[01:34:19] Folding@Home Gromacs SMP Core
[01:34:19] Version 1.74 (November 27, 2006)
[01:34:19] 
[01:34:19] Preparing to commence simulation
[01:34:19] - Ensuring status. Please wait.
[01:34:36] - Looking at optimizations...
[01:34:36] - Working with standard loops on this execution.
[01:34:36] - Previous termination of core was improper.
[01:34:36] - Going to use standard loops.
[01:34:36] - Files status OK
[01:34:37] (decompressed 530.9 percent)
[01:34:37] - Starting from initial work packet
[01:34:37] 
[01:34:37] Project: 2605 (Run 0, Clone 540, Gen 46)
[01:34:37] 
[01:34:37] Entering M.D.
[01:34:37] ne 540, Gen 46)
[01:34:37] 
[01:34:37] Entering M.D.
[01:34:44] g local files
[01:34:44] boost OK.
[01:34:44] boost OK.
[01:34:44] ocal files
[01:34:44] Extra SSE boost OK.
[01:34:45] Finalizing output
[01:34:45] cent)
[01:34:49] CoreStatus = 0 (0)
[01:34:49] Client-core communications error: ERROR 0x0
[01:34:49] Deleting current work unit & continuing...
[01:39:13] - Warning: Could not delete all work unit files (1): Core returned invalid code
[01:39:13] Trying to send all finished work units
[01:39:13] + No unsent completed units remaining.
[01:39:13] - Preparing to get new work unit...
[01:39:13] + Attempting to get work packet
[01:39:13] - Will indicate memory of 768 MB
[01:39:13] - Connecting to assignment server
[01:39:13] Connecting to http://assign.stanford.edu:8080/
[01:39:13] Posted data.
[01:39:13] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[01:39:13] + News From Folding@Home: Welcome to Folding@Home
[01:39:13] Loaded queue successfully.
[01:39:13] Connecting to http://171.64.65.56:8080/
[01:39:16] Posted data.
[01:39:16] Initial: 0000; - Receiving payload (expected size: 2421361)
[01:39:19] - Downloaded at ~788 kB/s
[01:39:19] - Averaged speed for that direction ~750 kB/s
[01:39:19] + Received work.
[01:39:19] + Closed connections
[01:39:24] 
[01:39:24] + Processing work unit
[01:39:24] Core required: FahCore_a1.exe
[01:39:24] Core found.
[01:39:24] Working on Unit 02 [July 2 01:39:24]
[01:39:24] + Working ...
[01:39:24] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 02 -priority 96 -checkpoint 20 -verbose -lifeline 5410 -version 602'

[01:39:24] 
[01:39:24] *------------------------------*
[01:39:24] Folding@Home Gromacs SMP Core
[01:39:24] Version 1.74 (November 27, 2006)
[01:39:24] 
[01:39:24] Preparing to commence simulation
[01:39:24] - Ensuring status. Please wait.
[01:39:41] - Looking at optimizations...
[01:39:41] - Working with standard loops on this execution.
[01:39:41] - Previous termination of core was improper.
[01:39:41] - Going to use standard loops.
[01:39:41] - Files status OK
[01:39:42] (decompressed 530.9 percent)
[01:39:42] - Starting from initial work pa- Starting from initial work packet
[01:39:42] 
[01:39:42] Project: 2605 (Run 0, Clone 540, Gen 46)
[01:39:42] 
[01:39:42] Entering M.D.
[01:39:49] g local files
[01:39:49] in in POPC
[01:39:49] Writing local files
[01:39:49] Extra SSE boost OK.
[01:39:50] 0000 steps  (0 percent)
[01:39:50] 
[01:39:50] Folding@home Core Shutdown: INTERRUPTED
[01:39:55] CoreStatus = 66 (102)
[01:39:55] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[01:39:55] Killing all core threads

Folding@Home Client Shutdown.
This is on a known stable box.
Image
rbrandman
Pande Group Member
Posts: 22
Joined: Wed May 14, 2008 4:11 pm

Re: 2605 (Run 0, Clone 540, Gen 46) CoreStatus = 66 (102)

Post by rbrandman »

Thanks for your post. I'll alert the researcher in charge of this project, Peter Kasson, so he can check into it.

Relly
Flathead74
Posts: 266
Joined: Sun Dec 02, 2007 6:08 pm
Location: Central New York
Contact:

2605 (Run 0, Clone 540, Gen 46)

Post by Flathead74 »

I have received this WU multiple times, and on different machines, in the past few days.

The result is always the same:

Code: Select all

[20:09:16] Project: 2605 (Run 0, Clone 540, Gen 46)
[20:09:16] 
[20:09:16] Assembly optimizations on if available.
[20:09:16] Entering M.D.
[20:09:22] Rejecting checkpoint
[20:09:23] OPC
[20:09:23] Writing local files
[20:09:23] Extra SSE boost OK.
[20:09:23] 
[20:09:23] Extra SSE boost OK.
[20:09:23] Writing local files
[20:09:23] Completed 0 out of 500000 steps  (0 percent)
[20:09:28] CoreStatus = 0 (0)
[20:09:28] Client-core communications error: ERROR 0x0
[20:09:28] Deleting current work unit & continuing...
As this WU has been tried by many, and with a great variety of hardware,
perhaps the time is right to remove this time waster from circulation.
rbrandman
Pande Group Member
Posts: 22
Joined: Wed May 14, 2008 4:11 pm

Re: 2605 (Run 0, Clone 540, Gen 46) CoreStatus = 66 (102)

Post by rbrandman »

Thanks, I'll let Peter know that there are still issues with the WU.

Relly
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: 2605 (Run 0, Clone 540, Gen 46) CoreStatus = 66 (102)

Post by kasson »

WU stopped--shouldn't be assigned any further.
Flathead74
Posts: 266
Joined: Sun Dec 02, 2007 6:08 pm
Location: Central New York
Contact:

Re: 2605 (Run 0, Clone 540, Gen 46) CoreStatus = 66 (102)

Post by Flathead74 »

kasson wrote:WU stopped--shouldn't be assigned any further.
Thank you, ever so much.
Post Reply