Project: 2671 (Run 19, Clone 69, Gen 25) seg fault

Moderators: Site Moderators, FAHC Science Team

Post Reply
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Project: 2671 (Run 19, Clone 69, Gen 25) seg fault

Post by alpha754293 »

log:

Code: Select all

[17:32:36] 
[17:32:36] *------------------------------*
[17:32:36] Folding@Home Gromacs SMP Core
[17:32:36] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[17:32:36] 
[17:32:36] Preparing to commence simulation
[17:32:36] - Ensuring status. Please wait.
[17:32:37] Called DecompressByteArray: compressed_data_size=4836660 data_size=24035457, decompressed_data_size=24035457 diff=0
[17:32:37] - Digital signature verified
[17:32:37] 
[17:32:37] Project: 2671 (Run 19, Clone 69, Gen 25)
[17:32:37] 
[17:32:37] Assembly optimizations on if available.
[17:32:37] Entering M.D.
[17:32:46] Run 19, Clone 69, Gen 25)
[17:32:46] 
[17:32:47] Entering M.D.
NNODES=4, MYRANK=1, HOSTNAME=computenode
NNODES=4, MYRANK=0, HOSTNAME=computenode
NNODES=4, MYRANK=3, HOSTNAME=computenode
NNODES=4, MYRANK=2, HOSTNAME=computenode
NODEID=0 argc=20
NODEID=1 argc=20
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_05.tpr, VERSION 3.3.99_development_20070618 (single precision)
NODEID=2 argc=20
NODEID=3 argc=20
Note: tpx file_version 48, software version 64

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22884 system in water'
6500000 steps,  13000.0 ps (continuing from step 6250000,  12500.0 ps).
[17:41:42] pleted 2500 out of 250000 steps  (1%)
[17:50:25] Completed 5000 out of 250000 steps  (2%)
[17:59:07] Completed 7500 out of 250000 steps  (3%)
[18:07:49] Completed 10000 out of 250000 steps  (4%)
[18:16:31] Completed 12500 out of 250000 steps  (5%)
[18:25:14] Completed 15000 out of 250000 steps  (6%)
[18:33:56] Completed 17500 out of 250000 steps  (7%)
[18:42:41] Completed 20000 out of 250000 steps  (8%)
[18:51:28] Completed 22500 out of 250000 steps  (9%)
[19:00:15] Completed 25000 out of 250000 steps  (10%)
[19:09:01] Completed 27500 out of 250000 steps  (11%)
[19:17:48] Completed 30000 out of 250000 steps  (12%)
[19:26:35] Completed 32500 out of 250000 steps  (13%)
[19:35:24] Completed 35000 out of 250000 steps  (14%)
[19:44:14] Completed 37500 out of 250000 steps  (15%)
[19:53:03] Completed 40000 out of 250000 steps  (16%)
[20:01:54] Completed 42500 out of 250000 steps  (17%)
[20:10:43] Completed 45000 out of 250000 steps  (18%)
[20:19:32] Completed 47500 out of 250000 steps  (19%)
[20:28:21] Completed 50000 out of 250000 steps  (20%)
[20:37:08] Completed 52500 out of 250000 steps  (21%)
[20:45:56] Completed 55000 out of 250000 steps  (22%)
[20:54:44] Completed 57500 out of 250000 steps  (23%)
[21:03:32] Completed 60000 out of 250000 steps  (24%)
[21:12:19] Completed 62500 out of 250000 steps  (25%)
[21:21:06] Completed 65000 out of 250000 steps  (26%)
[21:29:54] Completed 67500 out of 250000 steps  (27%)
[21:38:44] Completed 70000 out of 250000 steps  (28%)
[21:47:15] - Autosending finished units... [May 9 21:47:15 UTC]
[21:47:15] Trying to send all finished work units
[21:47:15] + No unsent completed units remaining.
[21:47:15] - Autosend completed
[21:47:35] Completed 72500 out of 250000 steps  (29%)
[21:56:25] Completed 75000 out of 250000 steps  (30%)
[22:05:13] Completed 77500 out of 250000 steps  (31%)
[22:14:01] Completed 80000 out of 250000 steps  (32%)
[22:22:48] Completed 82500 out of 250000 steps  (33%)
[22:31:36] Completed 85000 out of 250000 steps  (34%)
[22:40:25] Completed 87500 out of 250000 steps  (35%)
[22:49:15] Completed 90000 out of 250000 steps  (36%)
[22:58:03] Completed 92500 out of 250000 steps  (37%)
[23:06:50] Completed 95000 out of 250000 steps  (38%)
[23:15:38] Completed 97500 out of 250000 steps  (39%)
[23:24:26] Completed 100000 out of 250000 steps  (40%)
[23:33:17] Completed 102500 out of 250000 steps  (41%)
[23:42:07] Completed 105000 out of 250000 steps  (42%)
[23:50:57] Completed 107500 out of 250000 steps  (43%)
[23:59:46] Completed 110000 out of 250000 steps  (44%)
[00:08:34] Completed 112500 out of 250000 steps  (45%)
[00:17:26] Completed 115000 out of 250000 steps  (46%)
[00:26:15] Completed 117500 out of 250000 steps  (47%)
[00:35:07] Completed 120000 out of 250000 steps  (48%)
[00:43:58] Completed 122500 out of 250000 steps  (49%)
[00:52:50] Completed 125000 out of 250000 steps  (50%)
[01:01:43] Completed 127500 out of 250000 steps  (51%)
[01:10:37] Completed 130000 out of 250000 steps  (52%)
[01:19:32] Completed 132500 out of 250000 steps  (53%)
[01:28:27] Completed 135000 out of 250000 steps  (54%)
[01:37:22] Completed 137500 out of 250000 steps  (55%)
[01:46:17] Completed 140000 out of 250000 steps  (56%)
[01:55:07] Completed 142500 out of 250000 steps  (57%)
[02:03:58] Completed 145000 out of 250000 steps  (58%)
[02:12:49] Completed 147500 out of 250000 steps  (59%)
[02:21:43] Completed 150000 out of 250000 steps  (60%)
[02:30:37] Completed 152500 out of 250000 steps  (61%)
[02:39:22] Completed 155000 out of 250000 steps  (62%)
[02:48:08] Completed 157500 out of 250000 steps  (63%)
[02:56:53] Completed 160000 out of 250000 steps  (64%)

Step 6411275, time 12822.6 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 56981220198.444717, max 3364836081664.000000 (between atoms 1072 and 1074)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length

Step 6411275, time 12822.6 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 266152056.328266, max 15976105984.000000 (between atoms 1119 and 1121)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
   1759   1760   91.1    0.1090 986.5942      0.1090
   1759   1761   90.0    0.1090 573.7192      0.1090
   1762   1763   90.0    0.1090   1.1847      0.1090
   1762   1764   90.0    0.1090   1.2364      0.1090
    759    760   90.0    0.1090   0.1624      0.1090
    902    903   90.0    0.1080 647.5130      0.1080
   1048   1049   90.0    0.1090  29.8827      0.1090
   1072   1073   90.0    0.1090 9793312768.0000      0.1090
   1072   1074   90.0    0.1090 366767112192.0000      0.1090
   1072   1075   90.0    0.1090 10120730624.0000      0.1090
   1101   1102   90.0    0.1090   0.4604      0.1090
   1103   1104   90.0    0.1090   1.4689      0.1090
   1105   1107   90.0    0.1090   0.4581      0.1090
   1089   1090   90.0    0.1090   3.6018      0.1090
   1089   1091   99.6    0.1090  20.5103      0.1090
   1094   1095   90.7    0.1010 722.5540      0.1010
    759    760   90.0    0.1090   0.1624      0.1090
   1119   1120   90.0    0.1090 33718968.0000      0.1090
   1119   1121   90.0    0.1090 1741395456.0000      0.1090
   1125   1126   90.0    0.1090   0.9613      0.1090

t = 12822.551 ps: Water molecule starting at atom 141886 can not be settled.
Check for bad contacts and/or reduce the timestep.

t = 12822.551 ps: Water molecule starting at atom 146467 can not be settled.
Check for bad contacts and/or reduce the timestep.
[03:01:22] 
[03:01:22] Folding@home Core Shutdown: INTERRUPTED
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 102) - process 0
[cli_3]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
[0]0:Return code = 102
[0]1:Return code = 0, signaled with Segmentation fault
[0]2:Return code = 0, signaled with Segmentation fault
[0]3:Return code = 1
[03:01:26] CoreStatus = 66 (102)
[03:01:26] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[03:01:26] Killing all core threads

Folding@Home Client Shutdown.
restarting client...
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2671 (Run 19, Clone 69, Gen 25) seg fault

Post by bruce »

This WU was successfully completed by two people . . . one with MacOS-X and one with Linux.
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: Project: 2671 (Run 19, Clone 69, Gen 25) seg fault

Post by kasson »

How did the WU upload?
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2671 (Run 19, Clone 69, Gen 25) seg fault

Post by alpha754293 »

kasson wrote:How did the WU upload?
According to the data that I've got per the system log (i.e. 2>&1 | tee -a fah2.txt), the WU was successfully completed May 10 13:33 UTC 2009 and was sent successfully May 10 13:44 UTC 2009.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2671 (Run 19, Clone 69, Gen 25) seg fault

Post by bruce »

alpha754293 wrote:
kasson wrote:How did the WU upload?
According to the data that I've got per the system log (i.e. 2>&1 | tee -a fah2.txt), the WU was successfully completed May 10 13:33 UTC 2009 and was sent successfully May 10 13:44 UTC 2009.
So the error that you're reporting was not reproducible when the WU restarted?
DaveHand
Posts: 11
Joined: Mon Dec 10, 2007 11:56 am
Location: Hastings, England

Re: Project: 2671 (Run 19, Clone 69, Gen 25) seg fault

Post by DaveHand »

Don't take it personally, but frequent segmentation faults that cannot be reproduced are a sign of an unstable system.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2671 (Run 19, Clone 69, Gen 25) seg fault

Post by alpha754293 »

DaveHand wrote:Don't take it personally, but frequent segmentation faults that cannot be reproduced are a sign of an unstable system.
I'd agree with you, but I/we haven't been able to find a known, consistent cause of it though.

Therefore; without which, it is practically impossible to chase down the source of the problem, let alone resolve it.

Personally, I'm about to ask one of my profs for help on writing a simple Fortran code that will run a direct Gauss elimination on possibly a random order of 1 million in parallel.

I think that might be the only way for me to stress the system (or LU decomp).

I'm not sure if any of the iterative solvers would do it.

GMRES might, but I want to make it so that it loads the entire matrix into the 16 GB of RAM in the system so that it doesn't swap.

It's already passed memtest86 without ECC being enabled (one of my old workstations had to have ECC enabled in order to stabilize it).

I've also ran wPrime and Prime95 already and it's passed both although there are people here who've said that it doesn't stress the CPU enough. (Hence the idea for using Gauss elim.).

The only other thing that I can think of besides Gauss elim. would be to do a straight MMUL and MADD and/or some combination thereof. I'm not sure if that would hit the L1 and L2 caches sufficiently though, so I don't know.

Last resort for me would be to use MATLAB and have it run an auto-parallel version of a code or try and see if I can get LINPACK HPL working.

The downside of NOT being a programmer.

P.S. It would almost be nearly impossible to reproduce the error at times either because it's picking up from where it left off (as opposed to restarting from scratch) or if the cause of it is environment (e.g. high ambient temp), that unless I was in an environmentally controlled room (which I'm not, other than a household central AC/heat), I'd have to log the environments in order to be able to reproduce it (which I don't do either).
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2671 (Run 19, Clone 69, Gen 25) seg fault

Post by bruce »

Passing Memtest is essential, but Prime95 tests a differnt part of your processor than FAH uses (standard integer operations rather than the floating point arithmetic done by FAH). Have you run StressCPU2? It's a lot more representative of what Gromacs does.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2671 (Run 19, Clone 69, Gen 25) seg fault

Post by alpha754293 »

bruce wrote:Passing Memtest is essential, but Prime95 tests a differnt part of your processor than FAH uses (standard integer operations rather than the floating point arithmetic done by FAH). Have you run StressCPU2? It's a lot more representative of what Gromacs does.
No, I have not.

I shall look into that and see what I come up with.

On another sidenote though, I've put it through a 125-hour FEA run and it completed without any problems. (I think that it was like some 900,000 elements, I forget exactly) and I've also ran a bunch of CFD runs on it and those completed without problems as well.

So unless F@H is doing something different (different ratio/combinations of FLOPs than what those codes are using), I can't really think of anything else that would be wrong with the system.

The only thing that I can try would be the next time that I get a seg fault, I clear the system and hope that it will restart the WU from scratch to see if it can do it again.

(I know that there's already been two or three times where it got reassigned the same WU that seg faulted, and the second time through, it finished without any problems, and at least in one case where it seg faulted when I restarted the client from checkpoint, and then when it got reassigned, it was able to finish it without any problems.)

If the seg faults were consistent, then I would definitely agree that it is a hardware issue. But the sporadic and inconsistency would tend to suggest something else altogether; or just random chance.
DaveHand
Posts: 11
Joined: Mon Dec 10, 2007 11:56 am
Location: Hastings, England

Re: Project: 2671 (Run 19, Clone 69, Gen 25) seg fault

Post by DaveHand »

I have experienced several segfaults recently, mostly on my q6600. Most of these I have put down to being on the edge of a stable overclock, but I have experienced one or two immediate ones when restarting the client. I've never lost a unit to it though, just fired up the client again and kept my fingers crossed. I have been meaning to drop the o/c a little and give the innards a dust.

My laptop however had a segmentation fault/LINCS warning recently which I thought bizarre as it is underclocked, clean and with many successfully completed work units under it's belt. Unit failed again but then completed when I dropped cpu speed to 1GHz.

I have a hunch (and would like to believe) there is something to this other than hardware instability, but it's very difficult to pinpoint. I would like to see the unit that failed on my laptop crunched by other stock hardware to see if it also fails.

What o/s do you use?
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2671 (Run 19, Clone 69, Gen 25) seg fault

Post by bruce »

alpha754293 wrote:If the seg faults were consistent, then I would definitely agree that it is a hardware issue. But the sporadic and inconsistency would tend to suggest something else altogether; or just random chance.
DaveHand wrote:My laptop however had a segmentation fault/LINCS warning recently which I thought bizarre as it is underclocked, clean and with many successfully completed work units under it's belt. Unit failed again but then completed when I dropped cpu speed to 1GHz.

I have a hunch (and would like to believe) there is something to this other than hardware instability, but it's very difficult to pinpoint. I would like to see the unit that failed on my laptop crunched by other stock hardware to see if it also fails.

What o/s do you use?
One reason SegFaults are inconsistent is that they can have a variety of problems. Marginal hardware stability is difficult to track, but I believe that some SegFaults are due to some problem in the WU or the code itself, though that's equally difficult to track. One reason the forum Mods are willing to check whether others have completed the same WU that you had trouble with is that we do know that WU corruption does happen. Personally, I'd guess that your LINCS Warning error might not be hardware related. Did we already look it up for you? If not, what were the PRCG numbers?
DaveHand
Posts: 11
Joined: Mon Dec 10, 2007 11:56 am
Location: Hastings, England

Re: Project: 2671 (Run 19, Clone 69, Gen 25) seg fault

Post by DaveHand »

Did we already look it up for you? If not, what were the PRCG numbers?
Yup. It finished fine once underclocked - so I guess it would not have been sent to another. Thanks.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2671 (Run 19, Clone 69, Gen 25) seg fault

Post by alpha754293 »

I think that on my system that's currently getting the seg faults, it's running SLES10 SP2 x64.
Post Reply