failed to pick up a WU

Moderators: Site Moderators, FAHC Science Team

Post Reply
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

failed to pick up a WU

Post by alpha754293 »

computenode running with single instance "-smp 8"

Trying to pick up a WU.

315 instances of errors of the following nature:

Code: Select all

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: checkpoint.c, line: 1151

Fatal error:
Checkpoint file is for a system of 147054 atoms, while the current system 

consists of 147009 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki 

at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------
Various WUs, since June 1 14:26:20 UTC.

Currently cleared the queue.dat twice, and removed the entire work/ once.

Haven't been able to work on a WU for 28 hours.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: failed to pick up a WU

Post by bruce »

You didn't post enough of FAHlog.txt to figure out what your problem is.

Maybe you restarted FAH before the previous version shut down completely. Maybe it's a permissions problem. Maybe you have more that one copy of /work. Maybe your downloads are being corrupted. Maybe ....

I'm going to guess that the title is wrong. I suspect that you have successfully picked up quite a few WUs, you just can't get them to run. Again, we need to see FAHlog.txt.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: failed to pick up a WU

Post by alpha754293 »

bruce wrote:You didn't post enough of FAHlog.txt to figure out what your problem is.

Maybe you restarted FAH before the previous version shut down completely. Maybe it's a permissions problem. Maybe you have more that one copy of /work. Maybe your downloads are being corrupted. Maybe ....

I'm going to guess that the title is wrong. I suspect that you have successfully picked up quite a few WUs, you just can't get them to run. Again, we need to see FAHlog.txt.
The log itself is 1.5 MB.

There'd be no way for me to post it.

There are no restarts. No changes in permission. Single copy of work/. Downloads COULD be being corrupted, I don't know, but I don't know/remember if there's a checksum that is verified although there is the digital signature that is verified and there were no errors on that.

Yes, the WUs do download, but they end up in an error ad simile to the one that I posted.

log snippet:

Code: Select all

[22:09:59] 
[22:09:59] *------------------------------*
[22:09:59] Folding@Home Gromacs SMP Core
[22:09:59] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[22:09:59] 
[22:09:59] Preparing to commence simulation
[22:09:59] - Ensuring status. Please wait.
[22:10:00] Called DecompressByteArray: compressed_data_size=4839349 data_size=24032753, decompressed_data_size=24032753 diff=0
[22:10:00] - Digital signature verified
[22:10:00] 
[22:10:00] Project: 2671 (Run 7, Clone 61, Gen 40)
[22:10:00] 
[22:10:00] Assembly optimizations on if available.
[22:10:00] Entering M.D.
[22:10:06] Using Gromacs checkpoints
[22:10:09] 
[22:10:10] Entering M.D.
[22:10:16] Using Gromacs checkpoints
NNODES=8, MYRANK=2, HOSTNAME=computenode
NNODES=8, MYRANK=0, HOSTNAME=computenode
NNODES=8, MYRANK=3, HOSTNAME=computenode
NNODES=8, MYRANK=4, HOSTNAME=computenode
NNODES=8, MYRANK=5, HOSTNAME=computenode
NNODES=8, MYRANK=6, HOSTNAME=computenode
NNODES=8, MYRANK=7, HOSTNAME=computenode
NNODES=8, MYRANK=1, HOSTNAME=computenode
NODEID=0 argc=23
NODEID=1 argc=23
NODEID=2 argc=23
NODEID=4 argc=23
NODEID=6 argc=23
NODEID=7 argc=23
NODEID=3 argc=23
NODEID=5 argc=23
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_07.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64

Reading checkpoint file work/wudata_07.cpt generated: Mon May 11 21:42:38 2009


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: checkpoint.c, line: 1151

Fatal error:
Checkpoint file is for a system of 146928 atoms, while the current system consists of 147081 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 8

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[0]4:Return code = 0, signaled with Quit
[0]5:Return code = 0, signaled with Quit
[0]6:Return code = 0, signaled with Quit
[0]7:Return code = 0, signaled with Quit
[22:10:27] CoreStatus = FF (255)
[22:10:27] Sending work to server
[22:10:27] Project: 2671 (Run 7, Clone 61, Gen 40)
[22:10:27] - Error: Could not get length of results file work/wuresults_07.dat
[22:10:27] - Error: Could not read unit 07 file. Removing from queue.
[22:10:27] Trying to send all finished work units
[22:10:27] + No unsent completed units remaining.
[22:10:27] - Preparing to get new work unit...
[22:10:27] + Attempting to get work packet
[22:10:27] - Will indicate memory of 16003 MB
[22:10:27] - Connecting to assignment server
[22:10:27] Connecting to http://assign.stanford.edu:8080/
[22:13:36] - Couldn't send HTTP request to server
[22:13:36] + Could not connect to Assignment Server
[22:13:36] Connecting to http://assign2.stanford.edu:80/
[22:13:42] Posted data.
[22:13:42] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[22:13:42] + News From Folding@Home: Welcome to Folding@Home
[22:13:42] Loaded queue successfully.
[22:13:42] Connecting to http://171.64.65.56:80/
[22:13:50] Posted data.
[22:13:50] Initial: 0000; - Receiving payload (expected size: 4845033)
[22:14:19] - Downloaded at ~163 kB/s
[22:14:19] - Averaged speed for that direction ~284 kB/s
[22:14:19] + Received work.
[22:14:19] Trying to send all finished work units
[22:14:19] + No unsent completed units remaining.
[22:14:19] + Closed connections
[22:14:24] 
[22:14:24] + Processing work unit
[22:14:24] Core required: FahCore_a2.exe
[22:14:24] Core found.
[22:14:24] Working on queue slot 08 [June 1 22:14:24 UTC]
[22:14:24] + Working ...
[22:14:24] - Calling './mpiexec -np 8 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 08 -checkpoint 15 -verbose -lifeline 16939 -version 624'

[22:14:24] 
[22:14:24] *------------------------------*
[22:14:24] Folding@Home Gromacs SMP Core
[22:14:24] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[22:14:24] 
[22:14:24] Preparing to commence simulation
[22:14:24] - Ensuring status. Please wait.
[22:14:34] - Looking at optimizations...
[22:14:34] - Working with standard loops on this execution.
[22:14:34] - Files status OK
[22:14:35] - Expanded 4844521 -> 24003985 (decompressed 495.4 percent)
[22:14:35] Called DecompressByteArray: compressed_data_size=4844521 data_size=24003985, decompressed_data_size=24003985 diff=0
[22:14:35] - Digital signature verified
[22:14:35] 
[22:14:35] Project: 2672 (Run 0, Clone 144, Gen 139)
[22:14:35] 
[22:14:35] Entering M.D.
[22:14:41] Using Gromacs checkpoints
NNODES=8, MYRANK=0, HOSTNAME=computenode
NNODES=8, MYRANK=1, HOSTNAME=computenode
NNODES=8, MYRANK=2, HOSTNAME=computenode
NNODES=8, MYRANK=3, HOSTNAME=computenode
NNODES=8, MYRANK=4, HOSTNAME=computenode
NNODES=8, MYRANK=6, HOSTNAME=computenode
NNODES=8, MYRANK=7, HOSTNAME=computenode
NNODES=8, MYRANK=5, HOSTNAME=computenode
NODEID=0 argc=23
NODEID=1 argc=23
NODEID=2 argc=23
NODEID=3 argc=23
NODEID=4 argc=23
NODEID=5 argc=23
NODEID=6 argc=23
NODEID=7 argc=23
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_08.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64

Reading checkpoint file work/wudata_08.cpt generated: Fri May  8 00:31:55 2009


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: checkpoint.c, line: 1151

Fatal error:
Checkpoint file is for a system of 147168 atoms, while the current system consists of 146859 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 8

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[0]4:Return code = 0, signaled with Quit
[0]5:Return code = 0, signaled with Quit
[0]6:Return code = 0, signaled with Quit
[0]7:Return code = 0, signaled with Quit
[22:14:52] CoreStatus = FF (255)
[22:14:52] Sending work to server
[22:14:52] Project: 2672 (Run 0, Clone 144, Gen 139)
[22:14:52] - Error: Could not get length of results file work/wuresults_08.dat
[22:14:52] - Error: Could not read unit 08 file. Removing from queue.
[22:14:52] Trying to send all finished work units
[22:14:52] + No unsent completed units remaining.
[22:14:52] - Preparing to get new work unit...
[22:14:52] + Attempting to get work packet
[22:14:52] - Will indicate memory of 16003 MB
[22:14:52] - Connecting to assignment server
[22:14:52] Connecting to http://assign.stanford.edu:8080/
[22:14:53] Posted data.
[22:14:53] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[22:14:53] + News From Folding@Home: Welcome to Folding@Home
[22:14:53] Loaded queue successfully.
[22:14:53] Connecting to http://171.67.108.24:8080/
[22:14:59] Posted data.
[22:14:59] Initial: 0000; - Receiving payload (expected size: 4839861)
[22:15:25] - Downloaded at ~181 kB/s
[22:15:25] - Averaged speed for that direction ~264 kB/s
[22:15:25] + Received work.
[22:15:25] Trying to send all finished work units
[22:15:25] + No unsent completed units remaining.
[22:15:25] + Closed connections
[22:15:30] 
[22:15:30] + Processing work unit
[22:15:30] Core required: FahCore_a2.exe
[22:15:30] Core found.
[22:15:30] Working on queue slot 09 [June 1 22:15:30 UTC]
[22:15:30] + Working ...
[22:15:30] - Calling './mpiexec -np 8 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 09 -checkpoint 15 -verbose -lifeline 16939 -version 624'

[22:15:30] 
[22:15:30] *------------------------------*
[22:15:30] Folding@Home Gromacs SMP Core
[22:15:30] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[22:15:30] 
[22:15:30] Preparing to commence simulation
[22:15:30] - Ensuring status. Please wait.
[22:15:40] - Looking at optimizations...
[22:15:40] - Working with standard loops on this execution.
[22:15:40] - Files status OK
[22:15:41] - Expanded 4839349 -> 24032753 (decompressed 496.6 percent)
[22:15:41] Called DecompressByteArray: compressed_data_size=4839349 data_size=24032753, decompressed_data_size=24032753 diff=0
[22:15:41] - Digital signature verified
[22:15:41] 
[22:15:41] Project: 2671 (Run 7, Clone 61, Gen 40)
[22:15:41] 
[22:15:41] Entering M.D.
[22:15:47] Using Gromacs checkpoints
NNODES=8, MYRANK=0, HOSTNAME=computenode
NNODES=8, MYRANK=1, HOSTNAME=computenode
NNODES=8, MYRANK=2, HOSTNAME=computenode
NNODES=8, MYRANK=3, HOSTNAME=computenode
NNODES=8, MYRANK=4, HOSTNAME=computenode
NNODES=8, MYRANK=5, HOSTNAME=computenode
NNODES=8, MYRANK=6, HOSTNAME=computenode
NNODES=8, MYRANK=7, HOSTNAME=computenode
NODEID=0 argc=23
NODEID=1 argc=23
NODEID=2 argc=23
NODEID=3 argc=23
NODEID=4 argc=23
NODEID=5 argc=23
NODEID=6 argc=23
NODEID=7 argc=23
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_09.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64

Reading checkpoint file work/wudata_09.cpt generated: Sat May 16 04:17:44 2009


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: checkpoint.c, line: 1151

Fatal error:
Checkpoint file is for a system of 147219 atoms, while the current system consists of 147081 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 8

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[0]4:Return code = 0, signaled with Quit
[0]5:Return code = 0, signaled with Quit
[0]6:Return code = 0, signaled with Quit
[0]7:Return code = 0, signaled with Quit
[22:15:59] CoreStatus = FF (255)
[22:15:59] Sending work to server
[22:15:59] Project: 2671 (Run 7, Clone 61, Gen 40)
[22:15:59] - Error: Could not get length of results file work/wuresults_09.dat
[22:15:59] - Error: Could not read unit 09 file. Removing from queue.
[22:15:59] Trying to send all finished work units
[22:15:59] + No unsent completed units remaining.
[22:15:59] - Preparing to get new work unit...
[22:15:59] + Attempting to get work packet
[22:15:59] - Will indicate memory of 16003 MB
[22:15:59] - Connecting to assignment server
[22:15:59] Connecting to http://assign.stanford.edu:8080/
[22:15:59] Posted data.
[22:15:59] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[22:15:59] + News From Folding@Home: Welcome to Folding@Home
[22:15:59] Loaded queue successfully.
[22:15:59] Connecting to http://171.67.108.24:8080/
[22:16:14] Posted data.
[22:16:14] Initial: 0000; - Receiving payload (expected size: 4839861)
[22:16:30] - Downloaded at ~295 kB/s
[22:16:30] - Averaged speed for that direction ~270 kB/s
[22:16:30] + Received work.
[22:16:30] Trying to send all finished work units
[22:16:30] + No unsent completed units remaining.
[22:16:30] + Closed connections
[22:16:35] 
[22:16:35] + Processing work unit
[22:16:35] Core required: FahCore_a2.exe
[22:16:35] Core found.
[22:16:35] Working on queue slot 00 [June 1 22:16:35 UTC]
[22:16:35] + Working ...
[22:16:35] - Calling './mpiexec -np 8 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 00 -checkpoint 15 -verbose -lifeline 16939 -version 624'

[22:16:35] 
[22:16:35] *------------------------------*
[22:16:35] Folding@Home Gromacs SMP Core
[22:16:35] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[22:16:35] 
[22:16:35] Preparing to commence simulation
[22:16:35] - Ensuring status. Please wait.
[22:16:44] - Looking at optimizations...
[22:16:44] - Working with standard loops on this execution.
[22:16:44] - Files status OK
[22:16:45] - Expanded 4839349 -> 24032753 (decompressed 496.6 percent)
[22:16:45] Called DecompressByteArray: compressed_data_size=4839349 data_size=24032753, decompressed_data_size=24032753 diff=0
[22:16:45] - Digital signature verified
[22:16:45] 
[22:16:45] Project: 2671 (Run 7, Clone 61, Gen 40)
[22:16:45] 
[22:16:46] Entering M.D.
[22:16:52] Using Gromacs checkpoints
NNODES=8, MYRANK=0, HOSTNAME=computenode
NNODES=8, MYRANK=1, HOSTNAME=computenode
NNODES=8, MYRANK=2, HOSTNAME=computenode
NNODES=8, MYRANK=3, HOSTNAME=computenode
NNODES=8, MYRANK=5, HOSTNAME=computenode
NNODES=8, MYRANK=6, HOSTNAME=computenode
NNODES=8, MYRANK=7, HOSTNAME=computenode
NNODES=8, MYRANK=4, HOSTNAME=computenode
NODEID=0 argc=23
NODEID=1 argc=23
NODEID=2 argc=23
NODEID=3 argc=23
NODEID=6 argc=23
NODEID=7 argc=23
NODEID=4 argc=23
NODEID=5 argc=23
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_00.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64

Reading checkpoint file work/wudata_00.cpt generated: Sun May 31 21:12:19 2009


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: checkpoint.c, line: 1151

Fatal error:
Checkpoint file is for a system of 147258 atoms, while the current system consists of 147081 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 8

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[0]4:Return code = 0, signaled with Quit
[0]5:Return code = 0, signaled with Quit
[0]6:Return code = 0, signaled with Quit
[0]7:Return code = 0, signaled with Quit
[22:17:03] CoreStatus = FF (255)
[22:17:03] Sending work to server
[22:17:03] Project: 2671 (Run 7, Clone 61, Gen 40)
[22:17:03] - Error: Could not get length of results file work/wuresults_00.dat
[22:17:03] - Error: Could not read unit 00 file. Removing from queue.
[22:17:03] Trying to send all finished work units
[22:17:03] + No unsent completed units remaining.
[22:17:03] - Preparing to get new work unit...
[22:17:03] + Attempting to get work packet
[22:17:03] - Will indicate memory of 16003 MB
[22:17:03] - Connecting to assignment server
[22:17:03] Connecting to http://assign.stanford.edu:8080/
[22:17:08] Posted data.
[22:17:08] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[22:17:08] + News From Folding@Home: Welcome to Folding@Home
[22:17:09] Loaded queue successfully.
[22:17:09] Connecting to http://171.67.108.24:8080/
[22:17:30] Posted data.
[22:17:30] Initial: 0000; - Error: Bad packet type from server, expected work assignment
[22:17:30] - Attempt #1  to get work failed, and no other work to do.
Waiting before retry.
[22:17:43] + Attempting to get work packet
[22:17:43] - Will indicate memory of 16003 MB
[22:17:43] - Connecting to assignment server
[22:17:43] Connecting to http://assign.stanford.edu:8080/
[22:17:48] Posted data.
[22:17:48] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[22:17:48] + News From Folding@Home: Welcome to Folding@Home
[22:17:48] Loaded queue successfully.
[22:17:48] Connecting to http://171.67.108.24:8080/
[22:18:39] Posted data.
[22:18:39] Initial: 0000; - Receiving payload (expected size: 4830168)
[22:19:08] - Downloaded at ~162 kB/s
[22:19:08] - Averaged speed for that direction ~248 kB/s
[22:19:08] + Received work.
[22:19:08] Trying to send all finished work units
[22:19:08] + No unsent completed units remaining.
[22:19:08] + Closed connections
[22:19:13] 
[22:19:13] + Processing work unit
[22:19:13] Core required: FahCore_a2.exe
[22:19:13] Core found.
[22:19:13] Working on queue slot 01 [June 1 22:19:13 UTC]
[22:19:13] + Working ...
[22:19:13] - Calling './mpiexec -np 8 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 15 -verbose -lifeline 16939 -version 624'

[22:19:13] 
[22:19:13] *------------------------------*
[22:19:13] Folding@Home Gromacs SMP Core
[22:19:13] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[22:19:13] 
[22:19:13] Preparing to commence simulation
[22:19:13] - Ensuring status. Please wait.
[22:19:23] - Looking at optimizations...
[22:19:23] - Working with standard loops on this execution.
[22:19:23] - Files status OK
[22:19:24] - Expanded 4829656 -> 24057089 (decompressed 498.1 percent)
[22:19:24] Called DecompressByteArray: compressed_data_size=4829656 data_size=24057089, decompressed_data_size=24057089 diff=0
[22:19:24] - Digital signature verified
[22:19:24] 
[22:19:24] Project: 2671 (Run 18, Clone 43, Gen 41)
[22:19:24] 
[22:19:24] Entering M.D.
[22:19:30] Using Gromacs checkpoints
NNODES=8, MYRANK=0, HOSTNAME=computenode
NNODES=8, MYRANK=1, HOSTNAME=computenode
NNODES=8, MYRANK=2, HOSTNAME=computenode
NNODES=8, MYRANK=3, HOSTNAME=computenode
NNODES=8, MYRANK=4, HOSTNAME=computenode
NNODES=8, MYRANK=5, HOSTNAME=computenode
NNODES=8, MYRANK=6, HOSTNAME=computenode
NNODES=8, MYRANK=7, HOSTNAME=computenode
NODEID=0 argc=23
NODEID=1 argc=23
NODEID=2 argc=23
NODEID=3 argc=23
NODEID=4 argc=23
NODEID=5 argc=23
NODEID=6 argc=23
NODEID=7 argc=23
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_01.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64

Reading checkpoint file work/wudata_01.cpt generated: Tue May 26 03:19:25 2009


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: checkpoint.c, line: 1151

Fatal error:
Checkpoint file is for a system of 147210 atoms, while the current system consists of 147246 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 8

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[0]4:Return code = 0, signaled with Quit
[0]5:Return code = 0, signaled with Quit
[0]6:Return code = 0, signaled with Quit
[0]7:Return code = 0, signaled with Quit
[22:19:41] CoreStatus = FF (255)
[22:19:41] Sending work to server
[22:19:41] Project: 2671 (Run 18, Clone 43, Gen 41)
[22:19:41] - Error: Could not get length of results file work/wuresults_01.dat
[22:19:41] - Error: Could not read unit 01 file. Removing from queue.
[22:19:41] Trying to send all finished work units
[22:19:41] + No unsent completed units remaining.
[22:19:41] - Preparing to get new work unit...
[22:19:41] + Attempting to get work packet
[22:19:41] - Will indicate memory of 16003 MB
[22:19:41] - Connecting to assignment server
[22:19:41] Connecting to http://assign.stanford.edu:8080/
[22:19:42] Posted data.
[22:19:42] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[22:19:42] + News From Folding@Home: Welcome to Folding@Home
[22:19:42] Loaded queue successfully.
[22:19:42] Connecting to http://171.67.108.24:8080/
[22:19:48] Posted data.
[22:19:48] Initial: 0000; - Receiving payload (expected size: 4830168)
[22:20:01] - Downloaded at ~362 kB/s
[22:20:01] - Averaged speed for that direction ~271 kB/s
[22:20:01] + Received work.
[22:20:01] Trying to send all finished work units
[22:20:01] + No unsent completed units remaining.
[22:20:01] + Closed connections
[22:20:06] 
[22:20:06] + Processing work unit
[22:20:06] Core required: FahCore_a2.exe
[22:20:06] Core found.
[22:20:06] Working on queue slot 02 [June 1 22:20:06 UTC]
[22:20:06] + Working ...
[22:20:06] - Calling './mpiexec -np 8 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 16939 -version 624'
So while you're technically correct in saying that I am able to download a WU, but if the WU can't run, the distinction is only a technical one.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: failed to pick up a WU

Post by bruce »

No, it's not a technical distinction if the server runs out of WUs. In the log you posted (the second time) it does show that nothing is returned so the server may have to wait for the timeout before it can reissue the WU to someone else. Those delays are 'expensive' in terms of project schedules if they can be avoided.

The bug you're seeing is the failure of FahCore_a2 to clean up after itself when it has certain errors. There have been a number of discussions of that issue and supposedly it was fixed but apparently not completely. (There's another active discussion of that same issue.)

You can start by manually discarding everything in /work, EXCEPT wuresults_*, if it the number embedded in the name is DIFFERENT than the active WU number. (00, 01, ... through 09).
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: failed to pick up a WU

Post by alpha754293 »

bruce wrote:No, it's not a technical distinction if the server runs out of WUs. In the log you posted (the second time) it does show that nothing is returned so the server may have to wait for the timeout before it can reissue the WU to someone else. Those delays are 'expensive' in terms of project schedules if they can be avoided.

The bug you're seeing is the failure of FahCore_a2 to clean up after itself when it has certain errors. There have been a number of discussions of that issue and supposedly it was fixed but apparently not completely. (There's another active discussion of that same issue.)

You can start by manually discarding everything in /work, EXCEPT wuresults_*, if it the number embedded in the name is DIFFERENT than the active WU number. (00, 01, ... through 09).
It appears as though any outstanding results were sent to the server already, and also that the slots were changing each time it tried to picked up a new WU.

It also appears as though that the server kept trying to give my system the same WU over and over again; until I cleared the queue.dat (twice) and the work/ (once).
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: failed to pick up a WU

Post by bruce »

Yes, every WU is assigned to the next queue position.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: failed to pick up a WU

Post by alpha754293 »

bruce wrote:Yes, every WU is assigned to the next queue position.
Removing the queue.dat file wasn't sufficient. At least not the first time through anyways.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: failed to pick up a WU

Post by bruce »

I didn't say a word about removing queue.dat. I spoke specifically about cleaning out /work. You brought up the queue position and I was responding to that:
...also that the slots were changing each time it tried to picked up a new WU.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: failed to pick up a WU

Post by alpha754293 »

bruce wrote:I didn't say a word about removing queue.dat. I spoke specifically about cleaning out /work. You brought up the queue position and I was responding to that:
...also that the slots were changing each time it tried to picked up a new WU.
it wasn't meant as a dis, an insult, or a monologue. My apologies.

Just reporting what I did in order to get the client moving along again.
Post Reply