Project: 2669 (Run 6, Clone 43, Gen 94) exploded

Moderators: Site Moderators, FAHC Science Team

Post Reply
terabytes
Posts: 5
Joined: Thu Sep 04, 2008 2:08 pm

Project: 2669 (Run 6, Clone 43, Gen 94) exploded

Post by terabytes »

at 60%:

Code: Select all

[16:23:38] *------------------------------*
[16:23:38] Folding@Home Gromacs SMP Core
[16:23:38] Version 2.01 (Wed Aug 13 13:11:25 PDT 2008)
[16:23:38] 
[16:23:38] Preparing to commence simulation
[16:23:38] - Ensuring status. Please wait.
[16:23:38] Files status OK
[16:23:39] - Expanded 4841648 -> 23982741 (decompressed 495.3 percent)
[16:23:39] Called DecompressByteArray: compressed_data_size=4841648 data_size=23982741, decompressed_data_size=23982741 diff=0
[16:23:40] - Digital signature verified
[16:23:40] 
[16:23:40] Project: 2669 (Run 6, Clone 43, Gen 94)
[16:23:40] 
[16:23:40] Assembly optimizations on if available.
[16:23:40] Entering M.D.
[16:23:49] (Run 6, Clone 43, Gen 94)
[16:23:49] 
[16:23:49] Entering M.D.
NNODES=4, MYRANK=2, HOSTNAME=BigBlue
NNODES=4, MYRANK=3, HOSTNAME=BigBlue
NNODES=4, MYRANK=0, HOSTNAME=BigBlue
NNODES=4, MYRANK=1, HOSTNAME=BigBlue
NODEID=0 argc=19
NODEID=1 argc=19
NODEID=2 argc=19
NODEID=3 argc=19
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 3.3.99_development_200800503  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_08.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 56
Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22896 system'
250000 steps,    500.0 ps.

Writing checkpoint, step 23503330 at Fri Mar  6 11:38:59 2009
[16:46:30] Completed 5000 out of 250000 steps  (2%)

Writing checkpoint, step 23506670 at Fri Mar  6 11:53:59 2009
[16:57:42] Completed 7500 out of 250000 steps  (3%)
[17:08:57] Completed 10000 out of 250000 steps  (4%)

Writing checkpoint, step 23510010 at Fri Mar  6 12:09:00 2009
[17:20:10] Completed 12500 out of 250000 steps  (5%)

Writing checkpoint, step 23513350 at Fri Mar  6 12:24:00 2009
[17:31:23] Completed 15000 out of 250000 steps  (6%)

Writing checkpoint, step 23516690 at Fri Mar  6 12:38:59 2009
[17:42:37] Completed 17500 out of 250000 steps  (7%)
[17:46:21] - Autosending finished units...
[17:46:21] Trying to send all finished work units
[17:46:21] + No unsent completed units remaining.
[17:46:21] - Autosend completed
[17:53:50] Completed 20000 out of 250000 steps  (8%)

Writing checkpoint, step 23520030 at Fri Mar  6 12:53:58 2009
[18:05:04] Completed 22500 out of 250000 steps  (9%)

Writing checkpoint, step 23523370 at Fri Mar  6 13:08:58 2009
[18:16:17] Completed 25000 out of 250000 steps  (10%)

Writing checkpoint, step 23526720 at Fri Mar  6 13:23:59 2009
[18:27:28] Completed 27500 out of 250000 steps  (11%)
[18:38:41] Completed 30000 out of 250000 steps  (12%)

Writing checkpoint, step 23530070 at Fri Mar  6 13:39:00 2009
[18:49:54] Completed 32500 out of 250000 steps  (13%)

Writing checkpoint, step 23533410 at Fri Mar  6 13:53:59 2009
[19:01:06] Completed 35000 out of 250000 steps  (14%)

Writing checkpoint, step 23536760 at Fri Mar  6 14:08:59 2009
[19:12:18] Completed 37500 out of 250000 steps  (15%)
[19:23:29] Completed 40000 out of 250000 steps  (16%)

Writing checkpoint, step 23540110 at Fri Mar  6 14:24:00 2009
[19:34:43] Completed 42500 out of 250000 steps  (17%)

Writing checkpoint, step 23543450 at Fri Mar  6 14:39:00 2009
[19:45:57] Completed 45000 out of 250000 steps  (18%)

Writing checkpoint, step 23546790 at Fri Mar  6 14:53:58 2009
[19:57:09] Completed 47500 out of 250000 steps  (19%)
[20:08:23] Completed 50000 out of 250000 steps  (20%)

Writing checkpoint, step 23550130 at Fri Mar  6 15:08:59 2009
[20:19:35] Completed 52500 out of 250000 steps  (21%)

Writing checkpoint, step 23553480 at Fri Mar  6 15:24:00 2009
[20:30:49] Completed 55000 out of 250000 steps  (22%)

Writing checkpoint, step 23556820 at Fri Mar  6 15:39:00 2009
[20:42:04] Completed 57500 out of 250000 steps  (23%)
[20:53:15] Completed 60000 out of 250000 steps  (24%)

Writing checkpoint, step 23560160 at Fri Mar  6 15:53:58 2009
[21:04:27] Completed 62500 out of 250000 steps  (25%)

Writing checkpoint, step 23563510 at Fri Mar  6 16:08:59 2009
[21:15:39] Completed 65000 out of 250000 steps  (26%)

Writing checkpoint, step 23566860 at Fri Mar  6 16:24:00 2009
[21:26:53] Completed 67500 out of 250000 steps  (27%)
[21:38:04] Completed 70000 out of 250000 steps  (28%)

Writing checkpoint, step 23570200 at Fri Mar  6 16:38:58 2009
[21:49:18] Completed 72500 out of 250000 steps  (29%)

Writing checkpoint, step 23573540 at Fri Mar  6 16:53:59 2009
[22:00:31] Completed 75000 out of 250000 steps  (30%)

Writing checkpoint, step 23576880 at Fri Mar  6 17:08:58 2009
[22:11:45] Completed 77500 out of 250000 steps  (31%)
[22:22:57] Completed 80000 out of 250000 steps  (32%)

Writing checkpoint, step 23580230 at Fri Mar  6 17:23:59 2009
[22:34:10] Completed 82500 out of 250000 steps  (33%)

Writing checkpoint, step 23583580 at Fri Mar  6 17:39:00 2009
[22:45:22] Completed 85000 out of 250000 steps  (34%)

Writing checkpoint, step 23586930 at Fri Mar  6 17:54:00 2009
[22:56:34] Completed 87500 out of 250000 steps  (35%)
[23:07:47] Completed 90000 out of 250000 steps  (36%)

Writing checkpoint, step 23590270 at Fri Mar  6 18:09:01 2009
[23:19:00] Completed 92500 out of 250000 steps  (37%)

Writing checkpoint, step 23593610 at Fri Mar  6 18:23:59 2009
[23:30:13] Completed 95000 out of 250000 steps  (38%)

Writing checkpoint, step 23596960 at Fri Mar  6 18:39:00 2009
[23:41:25] Completed 97500 out of 250000 steps  (39%)
[23:46:21] - Autosending finished units...
[23:46:21] Trying to send all finished work units
[23:46:21] + No unsent completed units remaining.
[23:46:21] - Autosend completed
[23:52:37] Completed 100000 out of 250000 steps  (40%)

Writing checkpoint, step 23600300 at Fri Mar  6 18:53:59 2009
[00:04:00] Completed 102500 out of 250000 steps  (41%)

Writing checkpoint, step 23603610 at Fri Mar  6 19:09:00 2009
[00:15:18] Completed 105000 out of 250000 steps  (42%)

Writing checkpoint, step 23606920 at Fri Mar  6 19:24:00 2009
[00:26:38] Completed 107500 out of 250000 steps  (43%)
[00:37:52] Completed 110000 out of 250000 steps  (44%)

Writing checkpoint, step 23610250 at Fri Mar  6 19:39:00 2009
[00:49:13] Completed 112500 out of 250000 steps  (45%)

Writing checkpoint, step 23613550 at Fri Mar  6 19:53:59 2009
[01:00:43] Completed 115000 out of 250000 steps  (46%)

Writing checkpoint, step 23616830 at Fri Mar  6 20:09:00 2009
[01:12:01] Completed 117500 out of 250000 steps  (47%)
[01:23:21] Completed 120000 out of 250000 steps  (48%)

Writing checkpoint, step 23620140 at Fri Mar  6 20:23:59 2009
[01:34:35] Completed 122500 out of 250000 steps  (49%)

Writing checkpoint, step 23623480 at Fri Mar  6 20:38:58 2009
[01:45:49] Completed 125000 out of 250000 steps  (50%)

Writing checkpoint, step 23626820 at Fri Mar  6 20:53:58 2009
[01:57:01] Completed 127500 out of 250000 steps  (51%)
[02:08:12] Completed 130000 out of 250000 steps  (52%)

Writing checkpoint, step 23630180 at Fri Mar  6 21:09:01 2009
[02:19:24] Completed 132500 out of 250000 steps  (53%)

Writing checkpoint, step 23633500 at Fri Mar  6 21:24:01 2009
[02:30:47] Completed 135000 out of 250000 steps  (54%)

Writing checkpoint, step 23636820 at Fri Mar  6 21:38:58 2009
[02:42:01] Completed 137500 out of 250000 steps  (55%)
[02:53:14] Completed 140000 out of 250000 steps  (56%)

Writing checkpoint, step 23640170 at Fri Mar  6 21:54:00 2009
[03:04:25] Completed 142500 out of 250000 steps  (57%)

Writing checkpoint, step 23643520 at Fri Mar  6 22:09:00 2009
[03:15:38] Completed 145000 out of 250000 steps  (58%)

Writing checkpoint, step 23646860 at Fri Mar  6 22:23:59 2009
[03:26:50] Completed 147500 out of 250000 steps  (59%)
[03:38:03] Completed 150000 out of 250000 steps  (60%)

Writing checkpoint, step 23650210 at Fri Mar  6 22:39:00 2009

Step 23650858, time 47301.7 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 0.415465, max 22.216923 (between atoms 16527 and 16529)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
  16530  16531   90.0    0.1007   0.1232      0.1010
  16530  16533   90.0    0.1011   2.0636      0.1010
Warning: 1-4 interaction between 16525 and 16529 at distance 2.426 which is larger than the 1-4 table size 2.200 nm
These are ignored for the rest of the simulation
This usually means your system is exploding,
if not, you should increase table-extension in your mdp file
or with user tables increase the table size
Warning: 1-4 interaction between 16521 and 16529 at distance 2.299 which is larger than the 1-4 table size 2.200 nm
These are ignored for the rest of the simulation
This usually means your system is exploding,
if not, you should increase table-extension in your mdp file
or with user tables increase the table size

Step 23650859, time 47301.7 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms nan, max inf (between atoms 16107 and 16109)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
  16107  16108   90.0    0.1090 8711409736148320256.0000      0.1090
  16107  16109   90.0    0.1090      inf      0.1090
  16599  16600   90.0    0.1010      inf      0.1010
  16599  16601   90.0    0.1010      inf      0.1010
  16093  16094   90.0    0.1090      inf      0.1090
  16095  16096   90.0    0.1090      inf      0.1090
  16095  16097   90.0    0.1090      inf      0.1090
  16103  16104   90.0    0.1010      inf      0.1010
  16105  16106   90.0    0.1090 360954624.0000      0.1090
  16266  16267   90.0    0.1090      inf      0.1090
  16266  16268   90.0    0.1090      inf      0.1090
  16266  16269   90.0    0.1090      inf      0.1090
   5469   5470   90.0    0.1010      inf      0.1010
   5473   5474   90.0    0.1090      inf      0.1090
   5473   5475   90.0    0.1090      inf      0.1090
   5476   5477   90.0    0.1090      inf      0.1090
   5476   5478   90.0    0.1090      inf      0.1090
   5481   5483   90.0    0.1010 4046722309459804160.0000      0.1010
   5440   5441   90.0    0.1010 50803440091136.0000      0.1010
   5486   5487   90.0    0.1010 156705744.0000      0.1010
   5488   5490   90.0    0.1090 2047592329421783040.0000      0.1090
   5495   5496   90.0    0.1090 283089184.0000      0.1090
   5497   5498   90.0    0.1090      inf      0.1090
   5497   5499   90.0    0.1090      inf      0.1090
   7288   7290   90.0    0.1090      inf      0.1090
   7300   7301   90.0    0.1010 371685698043904.0000      0.1010
   5444   5445   90.0    0.1090 6894299645481058304.0000      0.1090
   5444   5446   90.0    0.1090 87687237726109696.0000      0.1090

t = 47301.723 ps: Water molecule starting at atom 89350 can not be settled.
Check for bad contacts and/or reduce the timestep.

Step 23650859, time 47301.7 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms nan, max inf (between atoms 16118 and 16119)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
  16118  16119   90.0    0.1090      inf      0.1090
  16118  16120   90.0    0.1090      inf      0.1090
  16506  16507   90.0    0.1090      inf      0.1090
  16506  16508   90.0    0.1090      inf      0.1090
  16564  16565   90.0    0.1090      inf      0.1090
  16564  16566   90.0    0.1090      inf      0.1090
  16567  16568   90.0    0.1090 3182772251841789952.0000      0.1090
  16567  16569   90.0    0.1090 15470105513064136704.0000      0.1090
  16518  16520   90.0    0.1090 4206424999004733440.0000      0.1090
  16540  16541   90.0    0.1090      inf      0.1090
  16540  16542   90.0    0.1090      inf      0.1090
  16540  16543   90.0    0.1090      inf      0.1090
  16530  16531   90.0    0.1232      inf      0.1010
  16530  16532   90.0    0.1161      inf      0.1010
  16530  16533   90.0    2.0636      inf      0.1010
  19498  19499   90.0    0.1090      inf      0.1090
  19498  19500   90.0    0.1090      inf      0.1090

t = 47301.723 ps: Water molecule starting at atom 95137 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
Wrote pdb files with previous and current coordinates

Step 23650860:
The charge group starting at atom 16118 moved than the distance allowed by the domain decomposition (1.200000) in direction Z
distance out of cell -11666785377985849458688.000000
Old coordinates:    6.620    2.619    9.675
New coordinates: -14400475981299274743808.000 -9074503299372230377472.000 -11666785377985849458688.000

Step 23650860:
The charge group starting at atom 16106 moved than the distance allowed by the domain decomposition (1.200000) in direction Z
distance out of cell -221947904.000000
Old coordinates:    6.875    2.926    9.406
New coordinates: 420947104.000 -829908288.000 -221947904.000
Old cell boundaries in direction Z:    6.108    9.469
New cell boundaries in direction Z:    6.081    9.477

-------------------------------------------------------
Program mdrun, VERSION 3.3.99_development_200800503
Source code file: domdec.c, line: 2644

Fatal error:
A charge group move too far between two domain decomposition steps
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
[cli_0]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
Old cell boundaries in direction Z:    9.469   11.574
New cell boundaries in direction Z:    9.477   11.578

-------------------------------------------------------
Program mdrun, VERSION 3.3.99_development_200800503
Source code file: domdec.c, line: 2644

Fatal error:
A charge group move too far between two domain decomposition steps
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 2, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_2]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
[0]0:Return code = 1
[0]1:Return code = 255
[0]2:Return code = 255
[0]3:Return code = 0, signaled with Quit
[03:41:59] CoreStatus = FF (255)
[03:41:59] Client-core communications error: ERROR 0xff
[03:41:59] Deleting current work unit & continuing...
[0]1:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[03:42:13] - Warning: Could not delete all work unit files (8): Core file absent
[03:42:13] Trying to send all finished work units
[03:42:13] + No unsent completed units remaining.
[03:42:13] - Preparing to get new work unit...
[03:42:13] + Attempting to get work packet
[03:42:13] - Connecting to assignment server
[03:42:13] Connecting to http://assign.stanford.edu:8080/
[03:42:13] Posted data.
[03:42:13] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[03:42:13] + News From Folding@Home: Welcome to Folding@Home
[03:42:13] Loaded queue successfully.
[03:42:13] Connecting to http://171.64.65.64:8080/
[03:42:16] Posted data.
[03:42:16] Initial: 0000; - Receiving payload (expected size: 2445630)
[03:42:43] - Downloaded at ~88 kB/s
[03:42:43] - Averaged speed for that direction ~76 kB/s
[03:42:43] + Received work.
[03:42:43] + Closed connections
 
Not having a nice day.

Running Folding@Home Client Version 6.02 on Ubuntu Linux 8.10, Q6600 stock speed.
Last edited by terabytes on Sat Mar 07, 2009 1:18 pm, edited 1 time in total.
terabytes
Posts: 5
Joined: Thu Sep 04, 2008 2:08 pm

Re: Project: 2669 (Run 6, Clone 43, Gen 94) exploded

Post by terabytes »

Am I able to get credit for this work unit? If so, how?
Thanks.
Flathead74
Posts: 266
Joined: Sun Dec 02, 2007 6:08 pm
Location: Central New York
Contact:

Re: Project: 2669 (Run 6, Clone 43, Gen 94) exploded

Post by Flathead74 »

03:41:59] CoreStatus = FF (255)
[03:41:59] Client-core communications error: ERROR 0xff
[03:41:59] Deleting current work unit & continuing...

Seeing this in your Fahlog makes me think that you will not get any credit because nothing was returned to Stanford.

But hey, I could be wrong...
MtM
Posts: 1579
Joined: Fri Jun 27, 2008 2:20 pm
Hardware configuration: Q6600 - 8gb - p5q deluxe - gtx275 - hd4350 ( not folding ) win7 x64 - smp:4 - gpu slot
E6600 - 4gb - p5wdh deluxe - 9600gt - 9600gso - win7 x64 - smp:2 - 2 gpu slots
E2160 - 2gb - ?? - onboard gpu - win7 x32 - 2 uniprocessor slots
T5450 - 4gb - ?? - 8600M GT 512 ( DDR2 ) - win7 x64 - smp:2 - gpu slot
Location: The Netherlands
Contact:

Re: Project: 2669 (Run 6, Clone 43, Gen 94) exploded

Post by MtM »

Flathead74 wrote:03:41:59] CoreStatus = FF (255)
[03:41:59] Client-core communications error: ERROR 0xff
[03:41:59] Deleting current work unit & continuing...

Seeing this in your Fahlog makes me think that you will not get any credit because nothing was returned to Stanford.

But hey, I could be wrong...
I think that is correct ( but hey I could be wrong as well ;) )

I'm going to send a pm to the project owner so he can answer your question perhaps but also informing him about the issue with the work unit itself.
susato
Site Moderator
Posts: 511
Joined: Fri Nov 30, 2007 4:57 am
Location: Team MacOSX
Contact:

Re: Project: 2669 (Run 6, Clone 43, Gen 94) exploded

Post by susato »

A quick database check shows that no one has so far returned this WU for credit.
The previous generation was "Entered into logs at: 2009-03-06 09:22:09" shortly before you got it, so there is no evidence of any chain of failures on the unit before you got it.

I'm sorry to say that you will not get points for this unit as it was deleted rather than self-reporting its problems to Stanford. Error handling on failed WU's is a known issue with the Linux and OSX Folding software, which will be addressed in future versions of the cores and clients. I feel your pain - it has happened to all of us folding Linux/OSX units.

If you are assigned the same WU again, you can try stopping the WU a few frames before the crash point, waiting 5 minutes, then restarting. Often that will help get past a crash point. If you get similar errors on a regular basis, try a memtest.
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: Project: 2669 (Run 6, Clone 43, Gen 94) exploded

Post by kasson »

Try upgrading your client to 6.23 or 6.24. Those clients report a much wider range of errors to our servers.
terabytes
Posts: 5
Joined: Thu Sep 04, 2008 2:08 pm

Re: Project: 2669 (Run 6, Clone 43, Gen 94) exploded

Post by terabytes »

Thanks everyone. I'll upgrade my client as soon as I can.
Post Reply