Page 9 of 13

Re: List of SMP WUs with the "1 core usage" issue

Posted: Wed Sep 02, 2009 5:02 pm
by slugbug
I've been getting a few of these on my notfreds VMWare clients too.
Project: 2677 (Run 18, Clone 83, Gen 37)

Code: Select all

[08:09:43] - Preparing to get new work unit...
[08:09:43] + Attempting to get work packet
[08:09:43] - Connecting to assignment server
[08:09:43] - Successful: assigned to (171.64.65.56).
[08:09:43] + News From Folding@Home: Welcome to Folding@Home
[08:09:43] Loaded queue successfully.
[08:09:57] + Closed connections
[08:09:57] 
[08:09:57] + Processing work unit
[08:09:57] At least 4 processors must be requested.Core required: FahCore_a2.exe
[08:09:57] Core found.
[08:09:57] Working on Unit 08 [September 2 08:09:57]
[08:09:57] + Working ...
[08:09:57] 
[08:09:57] *------------------------------*
[08:09:57] Folding@Home Gromacs SMP Core
[08:09:57] Version 2.08 (Mon May 18 14:47:42 PDT 2009)
[08:09:57] 
[08:09:57] Preparing to commence simulation
[08:09:57] - Ensuring status. Please wait.
[08:10:06] - Assembly optimizations manually forced on.
[08:10:06] - Not checking prior termination.
[08:10:07] - Expanded 1493702 -> 24048413 (decompressed 1609.9 percent)
[08:10:08] Called DecompressByteArray: compressed_data_size=1493702 data_size=24048413, decompressed_data_size=24048413 diff=0
[08:10:08] - Digital signature verified
[08:10:08] 
[08:10:08] Project: 2677 (Run 18, Clone 83, Gen 37)
[08:10:08] 
[08:10:08] Assembly optimizations on if available.
[08:10:08] Entering M.D.
[08:10:15] Multi-core optimizations on
[08:10:31] Completed 0 out of 250000 steps  (0%)
[09:19:14] Completed 2500 out of 250000 steps  (1%)
[10:28:24] Completed 5000 out of 250000 steps  (2%)
[11:37:07] Completed 7500 out of 250000 steps  (3%)
[12:45:52] Completed 10000 out of 250000 steps  (4%)
Add another one: Project: 2677 (Run 11, Clone 88, Gen 31)
Redid my notfreds client and got another bugged WU right away. It's been sitting at 0% for nearly 30 min when it should be taking about 10min per step.

Re: List of SMP WUs with the "1 core usage" issue

Posted: Wed Sep 02, 2009 6:07 pm
by road-runner
Project: 2671 (Run 52, Clone 43, Gen 82)

Code: Select all

[10:26:16] Project: 2671 (Run 52, Clone 43, Gen 82)
[10:26:16] 
[10:26:16] Assembly optimizations on if available.
[10:26:16] Entering M.D.
[10:26:25] Run 52, Clone 43, Gen 82)
[10:26:25] 
[10:26:25] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=Apollo-Quad-Office
NNODES=4, MYRANK=1, HOSTNAME=Apollo-Quad-Office
NNODES=4, MYRANK=3, HOSTNAME=Apollo-Quad-Office
NNODES=4, MYRANK=2, HOSTNAME=Apollo-Quad-Office
NODEID=0 argc=20
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
NODEID=1 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
Note: tpx file_version 48, software version 64

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22887 system in water'
20750000 steps,  41500.0 ps (continuing from step 20500000,  41000.0 ps).

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483269. It should have been within [ 0 .. 9464 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very hi[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 255
[10:26:38] CoreStatus = FF (255)
[10:26:38] Sending work to server
[10:26:38] Project: 2671 (Run 52, Clone 43, Gen 82)
[10:26:38] - Error: Could not get length of results file work/wuresults_02.dat
[10:26:38] - Error: Could not read unit 02 file. Removing from queue.
[10:26:38] - Preparing to get new work unit...
[10:26:38] Cleaning up work directory
[10:26:38] + Attempting to get work packet
[10:26:38] - Connecting to assignment server
[10:26:40] - Successful: assigned to (171.67.108.24).
[10:26:40] + News From Folding@Home: Welcome to Folding@Home
[10:26:40] Loaded queue successfully.
[10:26:50] + Closed connections
[10:26:55] 
[10:26:55] + Processing work unit
[10:26:55] At least 4 processors must be requested; read 1.
[10:26:55] Core required: FahCore_a2.exe
[10:26:55] Core found.
[10:26:55] Working on queue slot 03 [September 2 10:26:55 UTC]
[10:26:55] + Working ...
[10:26:55] 
[10:26:55] *------------------------------*
[10:26:55] Folding@Home Gromacs SMP Core
[10:26:55] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[10:26:55] 
[10:26:55] Preparing to commence simulation
[10:26:55] - Ensuring status. Please wait.
[10:27:05] - Looking at optimizations...
[10:27:05] - Working with standard loops on this execution.
[10:27:05] - Files status OK
[10:27:05] - Expanded 1506342 -> 24008597 (decompressed 1593.8 percent)
[10:27:05] Called DecompressByteArray: compressed_data_size=1506342 data_size=24008597, decompressed_data_size=24008597 diff=0
[10:27:06] - Digital signature verified
[10:27:06] 
[10:27:06] Project: 2671 (Run 52, Clone 43, Gen 82)
[10:27:06] 
[10:27:06] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=Apollo-Quad-Office
NNODES=4, MYRANK=2, HOSTNAME=Apollo-Quad-Office
NNODES=4, MYRANK=3, HOSTNAME=Apollo-Quad-Office
NODEID=2 argc=20
NODEID=3 argc=20
NNODES=4, MYRANK=1, HOSTNAME=Apollo-Quad-Office
NODEID=0 argc=20
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_03.tpr, VERSION 3.3.99_development_20070618 (single precision)
NODEID=1 argc=20
Note: tpx file_version 48, software version 64

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22887 system in water'
20750000 steps,  41500.0 ps (continuing from step 20500000,  41000.0 ps).

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483269. It should have been within [ 0 .. 9464 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483611. It should have been within [ 0 .. 256 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3



Re: List of SMP WUs with the "1 core usage" issue

Posted: Wed Sep 02, 2009 6:12 pm
by road-runner
And another... Project: 2671 (Run 37, Clone 79, Gen 78)

Code: Select all

[04:15:30] + Attempting to send results [September 2 04:15:30 UTC]
[04:37:32] + Results successfully sent
[04:37:32] Thank you for your contribution to Folding@Home.
[04:37:32] + Number of Units Completed: 206

[04:37:35] - Preparing to get new work unit...
[04:37:35] Cleaning up work directory
[04:37:35] + Attempting to get work packet
[04:37:35] - Connecting to assignment server
[04:37:37] - Successful: assigned to (171.67.108.24).
[04:37:37] + News From Folding@Home: Welcome to Folding@Home
[04:37:37] Loaded queue successfully.
[04:37:50] + Closed connections
[04:37:50] 
[04:37:50] + Processing work unit
[04:37:50] At least 4 processors must be requested; read 1.
[04:37:50] Core required: FahCore_a2.exe
[04:37:50] Core found.
[04:37:50] Working on queue slot 08 [September 2 04:37:50 UTC]
[04:37:50] + Working ...
[04:37:50] 
[04:37:50] *------------------------------*
[04:37:50] Folding@Home Gromacs SMP Core
[04:37:50] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[04:37:50] 
[04:37:50] Preparing to commence simulation
[04:37:50] - Ensuring status. Please wait.
[04:37:51] Called DecompressByteArray: compressed_data_size=1513330 data_size=24038109, decompressed_data_size=24038109 diff=0
[04:37:51] - Digital signature verified
[04:37:51] 
[04:37:51] Project: 2671 (Run 37, Clone 79, Gen 78)
[04:37:51] 
[04:37:51] Assembly optimizations on if available.
[04:37:51] Entering M.D.
[04:38:00] Run 37, Clone 79, Gen 78)
[04:38:00] 
[04:38:01] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=Office-Quad1
NNODES=4, MYRANK=1, HOSTNAME=Office-Quad1
NNODES=4, MYRANK=2, HOSTNAME=Office-Quad1
NODEID=0 argc=20
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

NODEID=1 argc=20
Reading file work/wudata_08.tpr, VERSION 3.3.99_development_20070618 (single precision)
NNODES=4, MYRANK=3, HOSTNAME=Office-Quad1
NODEID=2 argc=20
NODEID=3 argc=20
Note: tpx file_version 48, software version 64

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22908 system in water'
19750002 steps,  39500.0 ps (continuing from step 19500002,  39000.0 ps).

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483611. It should have been within [ 0 .. 256 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483269. It should have been within [ 0 .. 9464 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3



Re: List of SMP WUs with the "1 core usage" issue

Posted: Wed Sep 02, 2009 6:20 pm
by slugbug
Whew, finally got a good one :)

Re: List of SMP WUs with the "1 core usage" issue

Posted: Wed Sep 02, 2009 7:45 pm
by 58Enfield
Two More:

Project: 2671 (R51, C50, G88)

size 1,498,239

Project: 2677 (R3, C20, G41)

size 1,503,814

Re: List of SMP WUs with the "1 core usage" issue

Posted: Wed Sep 02, 2009 9:04 pm
by uncle fuzzy
3 more

Project: 2677 (Run 3, Clone 75, Gen 38)
Project: 2677 (Run 25, Clone 35, Gen 32)
Project: 2677 (Run 17, Clone 42, Gen 34)

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 03, 2009 12:29 am
by ChasR
Project: 2671 (Run 49, Clone 98, Gen 84)
Project: 2677 (Run 17, Clone 42, Gen 34)

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 03, 2009 1:13 am
by AgrFan
Project: 2677 (Run 33, Clone 47, Gen 41)

Update: I just rebooted my notfred box to clear the bad WU and received the new core (2.10) in the process. The offending WU appears to be running OK now. It hung at 0% before and now it's at 2%. It looks like the frame times have increased ~25% though. I still may have the "1 core usage" problem with this WU. Let's see what happens tonight ...

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 03, 2009 1:25 am
by HayesK
Caught three more 1 core A2 today
p2677 (R1-C89-G43)
p2677 (R3-C79-G33)
p2677 (R39-C20-G41)

Good news. Received the new core 2.10 after deleting the old core. The first rig received a good WU. The 2nd rig initially got a bad WU, but errored immediately, then downloaded the same WU and errored again two more times before getting a good WU on the fourth attempt. On the third rig, I only deleted the old core and received the new core, which errored on the current bad wu, then then downloaded the same WU and errored again four more times before getting a good WU on the fifth attempt.

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 03, 2009 1:42 am
by bruce
HayesK wrote:Good news. Received the new core 2.10 after deleting the old core. The first rig received a good WU. The 2nd rig initially got a bad WU, but errored immediately, then downloaded the same WU and errored again two more times before getting a good WU on the fourth attempt. On the third rig, I only deleted the old core and received the new core, which errored on the current bad wu, then then downloaded the same WU and errored again four more times before getting a good WU on the fifth attempt.
Yes, this is what is expected to happen. Those of you who delete an old core, it will be replaced with 2.10 which is a big step toward solving this problem. I think you'll agree that downloading a bad WU several times is a lot better than processing that same WU at ~30% speed and perhaps missing the deadline. I expect a formal announcement soon -- perhaps with more information.

Not everyone will be reading this discussion, and for them, the Pande Group will probably force the download of 2.10 soon unless some other unpleasantries show up.

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 03, 2009 2:15 am
by toTOW
Announced on FAH-Addict 8-)

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 03, 2009 3:04 am
by uncle fuzzy
Picked up a 2677 30/64/26. Running VMware and notfred's. Deleted the folding folder, installed a fresh copy, and it downloaded a 2.10 core. Back up and folding. I'll kill the rest when I can and get the new core on all of them.

edit- It got Project: 2677 (Run 2, Clone 50, Gen 41) and tried to run it 3 times (CoreStatus = FF (255)), downloaded the core again and got another WU. Once I remembered to check the priority and affinity, the new one is running fine.

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 03, 2009 3:21 am
by DanGe
Seems like I was lucky. I picked up my first bad WU, 2671 (Run 37, Clone 79, Gen 78) and I happened to receive the new 2.10 core. Core immediately errored on a status FF a couple of times to get a good WU. The strange thing I've noticed is that the new core bumped my CPU usage higher.

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 03, 2009 3:38 am
by road-runner
I deleted mine and got the 2.10, hope things go better...

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 03, 2009 4:12 am
by HayesK
New core 2.10
Appears the new core 2.10 is not compatable with the work files completed by the old core 2.08. I deleted the old core on a partially complete wu and lost the work in progress due to error on the new core, and the same wu was downloaded and started at 0%. Fortunately, I chose a WU that was only at 2% to try the core change on.