Page 1 of 2

Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Sat Jun 06, 2009 5:10 pm
by tear
Water molecule can not be settled, 100% reproducible, see two following terminal
output snippets.

Snippet A (side note: client hung and did not continue until ^C):

Code: Select all

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

8 cores detected


--- Opening Log file [June 6 11:58:42 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /fah/clients/fah
Executable: ./fah6
Arguments: -oneunit -verbosity 9 -forceasm -smp

[11:58:42] - Ask before connecting: No
[11:58:42] - User name: tear (Team 100259)
[11:58:42] - User ID: 1FD229A605CD6A27
[11:58:42] - Machine ID: 3
[11:58:42]
[11:58:42] Loaded queue successfully.
[11:58:42] - Preparing to get new work unit...
[11:58:42] + Attempting to get work packet
[11:58:42] - Will indicate memory of 2013 MB
[11:58:42] - Connecting to assignment server
[11:58:42] Connecting to http://assign.stanford.edu:8080/
[11:58:42] - Autosending finished units... [11:58:42]
[11:58:42] Trying to send all finished work units
[11:58:42] + No unsent completed units remaining.
[11:58:42] - Autosend completed
[11:58:44] Posted data.
[11:58:44] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[11:58:44] + News From Folding@Home: Welcome to Folding@Home
[11:58:44] Loaded queue successfully.
[11:58:44] Connecting to http://171.67.108.24:8080/
[11:58:50] Posted data.
[11:58:50] Initial: 0000; - Receiving payload (expected size: 4842125)
[11:59:05] - Downloaded at ~315 kB/s
[11:59:05] - Averaged speed for that direction ~1264 kB/s
[11:59:05] + Received work.
[11:59:05] + Closed connections
[11:59:05]
[11:59:05] + Processing work unit
[11:59:05] At least 4 processors must be requested.Core required: FahCore_a2.exe
[11:59:05] Core found.
[11:59:05] Working on queue slot 05 [June 6 11:59:05 UTC]
[11:59:05] + Working ...
[11:59:05] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 05 -nocpulock -checkpoint 15 -forceasm -verbose -lifeline 27519 -version 624'

Warning: Ignoring unknown arg
Warning: Ignoring unknown arg
Warning: Ignoring unknown arg
Warning: Ignoring unknown arg
[11:59:05]
[11:59:05] *------------------------------*
[11:59:05] Folding@Home Gromacs SMP Core
[11:59:05] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[11:59:05]
[11:59:05] Preparing to commence simulation
[11:59:05] - Ensuring status. Please wait.
[11:59:14] - Assembly optimizations manually forced on.
[11:59:14] - Not checking prior termination.
[11:59:15] - Expanded 4841613 -> 24004881 (decompressed 495.8 percent)
[11:59:15] Called DecompressByteArray: compressed_data_size=4841613 data_size=24004881, decompressed_data_size=24004881 diff=0
[11:59:16] - Digital signature verified
[11:59:16]
[11:59:16] Project: 2671 (Run 3, Clone 82, Gen 42)
[11:59:16]
[11:59:16] Assembly optimizations on if available.
[11:59:16] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=octopus
NNODES=4, MYRANK=1, HOSTNAME=octopus
NNODES=4, MYRANK=2, HOSTNAME=octopus
NNODES=4, MYRANK=3, HOSTNAME=octopus
NODEID=0 argc=20
NODEID=1 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_05.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22878 system in water'
10750000 steps,  21500.0 ps (continuing from step 10500000,  21000.0 ps).
[11:59:25] Completed 0 out of 250000 steps  (0%)

t = 21000.005 ps: Water molecule starting at atom 95476 can not be settled.
Check for bad contacts and/or reduce the timestep.

t = 21000.007 ps: Water molecule starting at atom 46285 can not be settled.
Check for bad contacts and/or reduce the timestep.

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483503. It should have been within [ 0 .. 2312 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 2, will try to stop all the nodes
Halting parallel pro
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483519. It should have been within [ 0 .. 1800 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_2]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483527. It should have been within [ 0 .. 1568 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
gram mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
Snippet B: (client did not hang)

Code: Select all

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

8 cores detected


--- Opening Log file [June 6 16:27:21 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /fah/clients/fah
Executable: ./fah6
Arguments: -oneunit -verbosity 9 -forceasm -smp

[16:27:21] - Ask before connecting: No
[16:27:21] - User name: tear (Team 100259)
[16:27:21] - User ID: 1FD229A605CD6A27
[16:27:21] - Machine ID: 3
[16:27:21]
[16:27:22] Loaded queue successfully.
[16:27:22]
[16:27:22] + Processing work unit
[16:27:22] At least 4 processors must be requested.Core required: FahCore_a2.exe
[16:27:22] Core found.
[16:27:22] - Autosending finished units... [June 6 16:27:22 UTC]
[16:27:22] Working on queue slot 05 [June 6 16:27:22 UTC]
[16:27:22] Trying to send all finished work units
[16:27:22] + Working ...
[16:27:22] + No unsent completed units remaining.
[16:27:22] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 05 -nocpulock -checkpoint 15 -forceasm -verbose -lifeline 5785 -version 624'

[16:27:22] - Autosend completed
Warning: Ignoring unknown arg
Warning: Ignoring unknown arg
Warning: Ignoring unknown arg
Warning: Ignoring unknown arg
[16:27:22]
[16:27:22] *------------------------------*
[16:27:22] Folding@Home Gromacs SMP Core
[16:27:22] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[16:27:22]
[16:27:22] Preparing to commence simulation
[16:27:22] - Ensuring status. Please wait.
[16:27:31] - Assembly optimizations manually forced on.
[16:27:31] - Not checking prior termination.
[16:27:32] - Expanded 4841613 -> 24004881 (decompressed 495.8 percent)
[16:27:32] Called DecompressByteArray: compressed_data_size=4841613 data_size=24004881, decompressed_data_size=24004881 diff=0
[16:27:33] - Digital signature verified
[16:27:33]
[16:27:33] Project: 2671 (Run 3, Clone 82, Gen 42)
[16:27:33]
[16:27:33] Assembly optimizations on if available.
[16:27:33] Entering M.D.
NNODES=4, MYRANK=1, HOSTNAME=octopus
NNODES=4, MYRANK=0, HOSTNAME=octopus
NNODES=4, MYRANK=2, HOSTNAME=octopus
NNODES=4, MYRANK=3, HOSTNAME=octopus
NODEID=2 argc=20
NODEID=0 argc=20
NODEID=1 argc=20
NODEID=3 argc=20
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_05.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22878 system in water'
10750000 steps,  21500.0 ps (continuing from step 10500000,  21000.0 ps).
[16:27:42] Completed 0 out of 250000 steps  (0%)

t = 21000.005 ps: Water molecule starting at atom 95476 can not be settled.
Check for bad contacts and/or reduce the timestep.

t = 21000.007 ps: Water molecule starting at atom 46285 can not be settled.
Check for bad contacts and/or reduce the timestep.

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483503. It should have been within [ 0 .. 2312 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 2, will try to stop all the nodes
Halting parallel pro
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483519. It should have been within [ 0 .. 1800 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_2]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
gram mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483527. It should have been within [ 0 .. 1568 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 255
[0]3:Return code = 255
[16:27:48] CoreStatus = FF (255)
[16:27:48] Sending work to server
[16:27:48] Project: 2671 (Run 3, Clone 82, Gen 42)
[16:27:48] - Error: Could not get length of results file work/wuresults_05.dat
[16:27:48] - Error: Could not read unit 05 file. Removing from queue.
[16:27:48] Trying to send all finished work units
[16:27:48] + No unsent completed units remaining.
[16:27:48] + -oneunit flag given and have now finished a unit. Exiting.- Preparing to get new work unit...
[16:27:48] + Attempting to get work packet
[16:27:48] - Will indicate memory of 2013 MB
[16:27:48] - Connecting to assignment server
[16:27:48] ***** Got a SIGTERM signal (15)
[16:27:48] Connecting to http://assign.stanford.edu:8080/
[16:27:48] Killing all core threads

Folding@Home Client Shutdown.

tear

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Mon Jun 08, 2009 3:04 am
by susato
Tear, obviously this one isn't going to succeed on your equipment. The problem certainly looks like something wrong with the WU. Feel free to trash your work folder and queue and try to get another WU.

We'll wait a few days to see if anyone else reports trouble with this one. Thank you for posting.

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Mon Jun 08, 2009 5:33 am
by tear
No problem.

I reproduced the problem to make sure fault is not mine.

tear

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Mon Jun 08, 2009 6:27 am
by bruce
tear wrote:No problem.

I reproduced the problem to make sure fault is not mine.

tear
Reproduced the problem on a different computer?

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Mon Jun 08, 2009 12:43 pm
by tear
"Another computer" bit is not relevant because problem occurs at the exactly same simulation step every single time.

tear

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Wed Jun 10, 2009 2:21 pm
by tear
Got it assigned to another machine. Same thing.

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Wed Jun 10, 2009 10:55 pm
by klasseng
Got assigned to one of my machines . . . same thing.

Bad WU!?

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Wed Jun 10, 2009 11:00 pm
by bruce
Reported as a bad WU.

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Wed Jun 10, 2009 11:16 pm
by Foxbat
Bruce, I guess I was too fast... I picked it up on my 2.66 GHz Mac Pro Quad:

Code: Select all

[22:07:46] Completed 242500 out of 250000 steps  (97%)
[22:15:03] Completed 245000 out of 250000 steps  (98%)
[22:22:20] Completed 247500 out of 250000 steps  (99%)
[22:29:38] Completed 250000 out of 250000 steps  (100%)
[22:29:39] DynamicWrapper: Finished Work Unit: sleep=10000
[22:29:49] 
[22:29:49] Finished Work Unit:
[22:29:49] - Reading up to 21220128 from "work/wudata_05.trr": Read 21220128
[22:29:50] trr file hash check passed.
[22:29:50] - Reading up to 4399256 from "work/wudata_05.xtc": Read 4399256
[22:29:50] xtc file hash check passed.
[22:29:50] edr file hash check passed.
[22:29:50] logfile size: 181807
[22:29:50] Leaving Run
[22:29:54] - Writing 25946287 bytes of core data to disk...
[22:29:54]   ... Done.
[22:29:58] - Shutting down core
[22:29:58] 
[22:29:58] Folding@home Core Shutdown: FINISHED_UNIT
[22:33:13] CoreStatus = 64 (100)
[22:33:13] Unit 5 finished with 83 percent of time to deadline remaining.
[22:33:13] Updated performance fraction: 0.829133
[22:33:13] Sending work to server
[22:33:13] Project: 2676 (Run 3, Clone 179, Gen 140)


[22:33:13] + Attempting to send results [June 10 22:33:13 UTC]
[22:33:13] - Reading file work/wuresults_05.dat from core
[22:33:13]   (Read 25946287 bytes from disk)
[22:33:13] Connecting to http://171.67.108.24:8080/
[22:38:47] Posted data.
[22:38:47] Initial: 0000; - Uploaded at ~74 kB/s
[22:38:51] - Averaged speed for that direction ~73 kB/s
[22:38:51] + Results successfully sent
[22:38:51] Thank you for your contribution to Folding@Home.
[22:38:51] + Number of Units Completed: 998

[22:38:52] - Warning: Could not delete all work unit files (5): Core file absent
[22:38:52] Trying to send all finished work units
[22:38:52] + No unsent completed units remaining.
[22:38:52] - Preparing to get new work unit...
[22:38:52] Cleaning up work directory
[22:38:53] + Attempting to get work packet
[22:38:53] - Will indicate memory of 4096 MB
[22:38:53] - Connecting to assignment server
[22:38:53] Connecting to http://assign.stanford.edu:8080/
[22:38:53] Posted data.
[22:38:53] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[22:38:53] + News From Folding@Home: Welcome to Folding@Home
[22:38:53] Loaded queue successfully.
[22:38:53] Connecting to http://171.67.108.24:8080/
[22:38:59] Posted data.
[22:38:59] Initial: 0000; - Receiving payload (expected size: 4842125)
[22:39:08] - Downloaded at ~525 kB/s
[22:39:08] - Averaged speed for that direction ~424 kB/s
[22:39:08] + Received work.
[22:39:08] Trying to send all finished work units
[22:39:08] + No unsent completed units remaining.
[22:39:08] + Closed connections
[22:39:08] 
[22:39:08] + Processing work unit
[22:39:08] At least 4 processors must be requested; read 1.
[22:39:08] Core required: FahCore_a2.exe
[22:39:08] Core found.
[22:39:08] - Using generic ./mpiexec
[22:39:08] Working on queue slot 06 [June 10 22:39:08 UTC]
[22:39:08] + Working ...
[22:39:08] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 06 -priority 96 -checkpoint 8 -forceasm -verbose -lifeline 424 -version 624'

[22:39:09] 
[22:39:09] *------------------------------*
[22:39:09] Folding@Home Gromacs SMP Core
[22:39:09] Version 2.07 (Sun Apr 19 14:29:51 PDT 2009)
[22:39:09] 
[22:39:09] Preparing to commence simulation
[22:39:09] - Ensuring status. Please wait.
[22:39:18] - Assembly optimizations manually forced on.
[22:39:18] - Not checking prior termination.
[22:39:19] - Expanded 4841613 -> 24004881 (decompressed 495.8 percent)
[22:39:19] Called DecompressByteArray: compressed_data_size=4841613 data_size=24004881, decompressed_data_size=24004881 diff=0
[22:39:20] - Digital signature verified
[22:39:20] 
[22:39:20] Project: 2671 (Run 3, Clone 82, Gen 42)
[22:39:20] 
[22:39:20] Assembly optimizations on if available.
[22:39:20] Entering M.D.
[22:39:30] Completed 0 out of 250000 steps  (0%)
[22:39:32] 
[22:39:32] Folding@home Core Shutdown: INTERRUPTED
[22:39:36] CoreStatus = 66 (102)
[22:39:36] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[22:39:36] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [June 10 22:40:06 UTC] 


# Mac OS X SMP Console Edition ################################################
###############################################################################

                       Folding@Home Client Version 6.24R1

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /Users/Foxbat/Library/FAH-SMP-Term1
Executable: /Users/Foxbat/Library/FAH-SMP-Term1/fah6
Arguments: -local -advmethods -forceasm -verbosity 9 -smp 

[22:40:06] - Ask before connecting: No
[22:40:06] - User name: Foxbat (Team 55236)
[22:40:06] - User ID: 3DA6459B38FDAE1E
[22:40:06] - Machine ID: 1
[22:40:06] 
[22:40:06] Loaded queue successfully.
[22:40:06] - Autosending finished units... [June 10 22:40:06 UTC]
[22:40:06] 
[22:40:06] Trying to send all finished work units
[22:40:06] + Processing work unit
[22:40:06] + No unsent completed units remaining.
[22:40:06] At least 4 processors must be requested; read 1.
[22:40:06] - Autosend completed
[22:40:06] Core required: FahCore_a2.exe
[22:40:06] Core found.
[22:40:06] - Using generic ./mpiexec
[22:40:06] Working on queue slot 06 [June 10 22:40:06 UTC]
[22:40:06] + Working ...
[22:40:06] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 06 -priority 96 -checkpoint 8 -forceasm -verbose -lifeline 1553 -version 624'

[22:40:06] 
[22:40:06] *------------------------------*
[22:40:06] Folding@Home Gromacs SMP Core
[22:40:06] Version 2.07 (Sun Apr 19 14:29:51 PDT 2009)
[22:40:06] 
[22:40:06] Preparing to commence simulation
[22:40:06] - Ensuring status. Please wait.
[22:40:06] - Expanded 4841613 -> 24004881 (decompressed 495.8 percent)
[22:40:07] Called DecompressByteArray: compressed_data_size=4841613 data_size=24004881, decompressed_data_size=24004881 diff=0
[22:40:08] - Digital signature verified
[22:40:08] 
[22:40:08] Project: 2671 (Run 3, Clone 82, Gen 42)
[22:40:08] 
[22:40:08] Assembly optimizations on if available.
[22:40:08] Entering M.D.
[22:40:19]  on if available.
[22:40:19] Entering M.D.
[22:40:28]  (0%)
[22:40:30] 
[22:40:30] Folding@home Core Shutdown: INTERRUPTED
[22:40:34] CoreStatus = 66 (102)
[22:40:34] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[22:40:34] Killing all core threads

Folding@Home Client Shutdown.
It continues to fail to start. I'm going to clean out the Folding directory, apply Apple updates, and try again.

Update: Yep, that fixed it. Thanks, Bruce!

Project: 2671 (Run 3, Clone 82, Gen 42) seg fault

Posted: Thu Jun 11, 2009 3:55 am
by alpha754293
seg fault on a different comp; dual AMD Opteron 2220 on Tyan S2915WA2NRF.

log:

Code: Select all

[01:12:34] 
[01:12:34] *------------------------------*
[01:12:34] Folding@Home Gromacs SMP Core
[01:12:34] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[01:12:34] 
[01:12:34] Preparing to commence simulation
[01:12:34] - Ensuring status. Please wait.
[01:12:35] Called DecompressByteArray: compressed_data_size=4841613 data_size=24004881, decompressed_data_size=24004881 diff=0
[01:12:35] - Digital signature verified
[01:12:35] 
[01:12:35] Project: 2671 (Run 3, Clone 82, Gen 42)
[01:12:35] 
[01:12:35] Assembly optimizations on if available.
[01:12:35] Entering M.D.
[01:12:42] Multi-core optimizations on
[01:12:45] ntering M.D.
NNODES=4, MYRANK=0, HOSTNAME=opteron3
NNODES=4, MYRANK=2, HOSTNAME=opteron3
NNODES=4, MYRANK=3, HOSTNAME=opteron3
NNODES=4, MYRANK=1, HOSTNAME=opteron3
NODEID=0 argc=20
NODEID=1 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_06.tpr, VERSION 3.3.99_development_20070618 (single precision)
[01:12:52] Multi-core optimizations on
Note: tpx file_version 48, software version 64

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22878 system in water'
10750000 steps,  21500.0 ps (continuing from step 10500000,  21000.0 ps).
[01:12:54] Completed 0 out of 250000 steps  (0%)

t = 21000.005 ps: Water molecule starting at atom 95476 can not be settled.
Check for bad contacts and/or reduce the timestep.

t = 21000.007 ps: Water molecule starting at atom 46285 can not be settled.
Check for bad contacts and/or reduce the timestep.

t = 21000.009 ps: Water molecule starting at atom 95476 can not be settled.
Check for bad contacts and/or reduce the timestep.
[01:12:55] 
[01:12:55] Folding@home Core Shutdown: INTERRUPTED
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 102) - process 0
[cli_1]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
[cli_2]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
[0]0:Return code = 102
[0]1:Return code = 1
[0]2:Return code = 1
[0]3:Return code = 0, signaled with Segmentation fault
[01:12:59] CoreStatus = 66 (102)
[01:12:59] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[01:12:59] Killing all core threads

Folding@Home Client Shutdown.
restarting client...

Re: Project: 2671 (Run 3, Clone 82, Gen 42) seg fault

Posted: Thu Jun 11, 2009 4:12 am
by alpha754293
No matter how many times I restart the client, I get the same error.

Deleted queue.dat to move the client along.

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Fri Jun 12, 2009 4:08 am
by Foxbat
bruce wrote:Reported as a bad WU.
:( It's still in the system. I got it again today and it idled my Mac for most of the day.

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Fri Jun 12, 2009 4:27 am
by bruce
Foxbat wrote:
bruce wrote:Reported as a bad WU.
:( It's still in the system. I got it again today and it idled my Mac for most of the day.
All I can do is report it. Somebody from the Pande Group has to remove it from circulation.

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Fri Jun 12, 2009 7:06 pm
by bollix47
FYI

Code: Select all

[18:52:09] + Processing work unit
[18:52:09] At least 4 processors must be requested.Core required: FahCore_a2.exe
[18:52:09] Core found.
[18:52:09] Working on queue slot 00 [June 12 18:52:09 UTC]
[18:52:09] + Working ...
[18:52:09] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 00 -checkpoint 30 -verbose -lifeline 22373 -version 624'

[18:52:09] 
[18:52:09] *------------------------------*
[18:52:09] Folding@Home Gromacs SMP Core
[18:52:09] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[18:52:09] 
[18:52:09] Preparing to commence simulation
[18:52:09] - Ensuring status. Please wait.
[18:52:10] Called DecompressByteArray: compressed_data_size=4841613 data_size=24004881, decompressed_data_size=24004881 diff=0
[18:52:10] - Digital signature verified
[18:52:10] 
[18:52:10] Project: 2671 (Run 3, Clone 82, Gen 42)
[18:52:10] 
[18:52:10] Assembly optimizations on if available.
[18:52:10] Entering M.D.
[18:52:20] (Run 3, Clone 82, Gen 42)
[18:52:20] 
[18:52:20] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=challenger
NNODES=4, MYRANK=1, HOSTNAME=challenger
NNODES=4, MYRANK=2, HOSTNAME=challenger
NNODES=4, MYRANK=3, HOSTNAME=challenger
NODEID=0 argc=20
NODEID=1 argc=20
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_00.tpr, VERSION 3.3.99_development_20070618 (single precision)
NODEID=2 argc=20
NODEID=3 argc=20
Note: tpx file_version 48, software version 64

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22878 system in water'
10750000 steps,  21500.0 ps (continuing from step 10500000,  21000.0 ps).

t = 21000.005 ps: Water molecule starting at atom 95476 can not be settled.
Check for bad contacts and/or reduce the timestep.

t = 21000.007 ps: Water molecule starting at atom 46285 can not be settled.
Check for bad contacts and/or reduce the timestep.

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483519. It should have been within [ 0 .. 1800 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483503. It should have been within [ 0 .. 2312 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 2, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_2]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483527. It should have been within [ 0 .. 1568 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 255
[0]3:Return code = 255
[18:52:35] CoreStatus = FF (255)
[18:52:35] Sending work to server
[18:52:35] Project: 2671 (Run 3, Clone 82, Gen 42)
[18:52:35] - Error: Could not get length of results file work/wuresults_00.dat
[18:52:35] - Error: Could not read unit 00 file. Removing from queue.
[18:52:35] Trying to send all finished work units
[18:52:35] + No unsent completed units remaining.
The client did not abort or hang but carried on with a different WU and appears to be fine with the new one.

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Fri Jun 12, 2009 8:05 pm
by kasson
We re-generated the work unit. Hopefully it will run successfully now. Thanks for the error reports.