Page 1 of 3
Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Mon Mar 16, 2009 5:14 pm
by alpha754293
error:
Code: Select all
Launch directory: /share/fah1
Executable: ./fah1
Arguments: -smp 4 -verbosity 9
[17:13:15] - Ask before connecting: No
[17:13:15] - User name: alpha754293 (Team 596)
[17:13:15] - User ID: 47FBD1D4056DB49E
[17:13:15] - Machine ID: 1
[17:13:15]
[17:13:15] Loaded queue successfully.
[17:13:15]
[17:13:15] - Autosending finished units... [March 16 17:13:15 UTC]
[17:13:15] + Processing work unit
[17:13:15] Trying to send all finished work units
[17:13:15] Work type a1 not eligible for variable processors
[17:13:15] + No unsent completed units remaining.
[17:13:15] Core required: FahCore_a1.exe
[17:13:15] - Autosend completed
[17:13:15] Core found.
[17:13:15] Working on queue slot 01 [March 16 17:13:15 UTC]
[17:13:15] + Working ...
[17:13:15] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 01 -checkpoint 15 -verbose -lifeline 32238 -version 624'
[17:13:15]
[17:13:15] *------------------------------*
[17:13:15] Folding@Home Gromacs SMP Core
[17:13:15] Version 1.74 (November 27, 2006)
[17:13:15]
[17:13:15] Preparing to commence simulation
[17:13:15] - Ensuring status. Please wait.
[17:13:15]
[17:13:15] Project: 0 (Run 0, Clone 0, Gen 0)
[17:13:15]
[17:13:15] Error: Could not write local file. Exiting.
[17:13:20] - Shutting down core
[17:13:32] put
[17:13:32] - Starting from initial work packet
[17:13:32]
[17:13:32] Project: 0 (Run 0, Clone 0, Gen 0)
[17:13:32]
[17:13:32] Error: Could not write local file. Exiting.
[17:13:37] - Shutting down core
Uh. Help? (I don't even know where to begin on this one).
Re: Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Mon Mar 16, 2009 6:22 pm
by 7im
Re: Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Mon Mar 16, 2009 6:35 pm
by alpha754293
Problem persisted for about 3 to 3.5 hours. (Not quite the 4 that is mentioned in the wiki page).
I don't know if it assigned any server at all, including 0.0.0.0 or 127.0.0.1.
Here's the full console output since last WU. Note the times.
Code: Select all
[08:07:15] Trying to send all finished work units
[08:07:15] + No unsent completed units remaining.
[08:07:15] - Autosend completed
[08:13:04] Timered checkpoint triggered.
[08:14:03] Writing local files
[08:14:03] Completed 450000 out of 2500000 steps (18 percent)
[08:29:03] Timered checkpoint triggered.
[08:30:00] Writing local files
[08:30:00] Completed 475000 out of 2500000 steps (19 percent)
[08:45:00] Timered checkpoint triggered.
[08:45:26] Writing local files
[08:45:26] Completed 500000 out of 2500000 steps (20 percent)
[09:00:25] Timered checkpoint triggered.
[09:00:52] Writing local files
[09:00:52] Completed 525000 out of 2500000 steps (21 percent)
[09:15:52] Timered checkpoint triggered.
[09:16:18] Writing local files
[09:16:18] Completed 550000 out of 2500000 steps (22 percent)
[09:31:18] Timered checkpoint triggered.
[09:32:12] Writing local files
[09:32:12] Completed 575000 out of 2500000 steps (23 percent)
[09:47:11] Timered checkpoint triggered.
[09:48:33] Writing local files
[09:48:33] Completed 600000 out of 2500000 steps (24 percent)
[10:03:33] Timered checkpoint triggered.
[10:04:57] Writing local files
[10:04:57] Completed 625000 out of 2500000 steps (25 percent)
[10:19:56] Timered checkpoint triggered.
[10:21:22] Writing local files
[10:21:22] Completed 650000 out of 2500000 steps (26 percent)
[10:36:22] Timered checkpoint triggered.
[10:37:24] Writing local files
[10:37:24] Completed 675000 out of 2500000 steps (27 percent)
[10:52:23] Timered checkpoint triggered.
[10:53:20] Writing local files
[10:53:20] Completed 700000 out of 2500000 steps (28 percent)
[11:08:19] Timered checkpoint triggered.
[11:09:18] Writing local files
[11:09:18] Completed 725000 out of 2500000 steps (29 percent)
[11:24:18] Timered checkpoint triggered.
[11:25:08] Writing local files
[11:25:08] Completed 750000 out of 2500000 steps (30 percent)
[11:40:07] Timered checkpoint triggered.
[11:40:57] Writing local files
[11:40:57] Completed 775000 out of 2500000 steps (31 percent)
[11:55:57] Timered checkpoint triggered.
[11:56:47] Writing local files
[11:56:47] Completed 800000 out of 2500000 steps (32 percent)
[12:11:47] Timered checkpoint triggered.
[12:12:21] Writing local files
[12:12:21] Completed 825000 out of 2500000 steps (33 percent)
[12:27:20] Timered checkpoint triggered.
[12:27:30] Writing local files
[12:27:30] Completed 850000 out of 2500000 steps (34 percent)
[12:42:30] Timered checkpoint triggered.
[12:42:41] Writing local files
[12:42:41] Completed 875000 out of 2500000 steps (35 percent)
[12:57:41] Timered checkpoint triggered.
[12:58:31] Writing local files
[12:58:31] Completed 900000 out of 2500000 steps (36 percent)
[13:13:31] Timered checkpoint triggered.
[13:14:40] Writing local files
[13:14:40] Completed 925000 out of 2500000 steps (37 percent)
[13:29:40] Timered checkpoint triggered.
[13:30:29] Writing local files
[13:30:29] Completed 950000 out of 2500000 steps (38 percent)
[13:45:29] Timered checkpoint triggered.
[13:46:12] Writing local files
[13:46:12] Completed 975000 out of 2500000 steps (39 percent)
[14:01:12] Timered checkpoint triggered.
[14:01:50] Writing local files
[14:01:50] Completed 1000000 out of 2500000 steps (40 percent)
[14:07:15] - Autosending finished units... [March 15 14:07:15 UTC]
[14:07:15] Trying to send all finished work units
[14:07:15] + No unsent completed units remaining.
[14:07:15] - Autosend completed
[14:16:50] Timered checkpoint triggered.
[14:17:00] Writing local files
[14:17:00] Completed 1025000 out of 2500000 steps (41 percent)
[14:32:00] Timered checkpoint triggered.
[14:32:14] Writing local files
[14:32:14] Completed 1050000 out of 2500000 steps (42 percent)
[14:47:14] Timered checkpoint triggered.
[14:47:26] Writing local files
[14:47:26] Completed 1075000 out of 2500000 steps (43 percent)
[15:02:26] Timered checkpoint triggered.
[15:02:35] Writing local files
[15:02:35] Completed 1100000 out of 2500000 steps (44 percent)
[15:17:35] Timered checkpoint triggered.
[15:18:35] Writing local files
[15:18:36] Completed 1125000 out of 2500000 steps (45 percent)
[15:33:35] Timered checkpoint triggered.
[15:34:29] Writing local files
[15:34:29] Completed 1150000 out of 2500000 steps (46 percent)
[15:49:29] Timered checkpoint triggered.
[15:50:44] Writing local files
[15:50:44] Completed 1175000 out of 2500000 steps (47 percent)
[16:05:44] Timered checkpoint triggered.
[16:06:53] Writing local files
[16:06:53] Completed 1200000 out of 2500000 steps (48 percent)
[16:21:52] Timered checkpoint triggered.
[16:22:27] Writing local files
[16:22:27] Completed 1225000 out of 2500000 steps (49 percent)
[16:37:27] Timered checkpoint triggered.
[16:37:41] Writing local files
[16:37:41] Completed 1250000 out of 2500000 steps (50 percent)
[16:52:41] Timered checkpoint triggered.
[16:53:28] Writing local files
[16:53:28] Completed 1275000 out of 2500000 steps (51 percent)
[17:08:28] Timered checkpoint triggered.
[17:09:18] Writing local files
[17:09:18] Completed 1300000 out of 2500000 steps (52 percent)
[17:24:18] Timered checkpoint triggered.
[17:25:07] Writing local files
[17:25:08] Completed 1325000 out of 2500000 steps (53 percent)
[17:40:07] Timered checkpoint triggered.
[17:40:54] Writing local files
[17:40:54] Completed 1350000 out of 2500000 steps (54 percent)
[17:55:53] Timered checkpoint triggered.
[17:56:45] Writing local files
[17:56:45] Completed 1375000 out of 2500000 steps (55 percent)
[18:11:45] Timered checkpoint triggered.
[18:11:55] Writing local files
[18:11:55] Completed 1400000 out of 2500000 steps (56 percent)
[18:26:54] Timered checkpoint triggered.
[18:27:05] Writing local files
[18:27:05] Completed 1425000 out of 2500000 steps (57 percent)
[18:42:06] Timered checkpoint triggered.
[18:42:13] Writing local files
[18:42:13] Completed 1450000 out of 2500000 steps (58 percent)
[18:57:13] Timered checkpoint triggered.
[18:57:23] Writing local files
[18:57:23] Completed 1475000 out of 2500000 steps (59 percent)
[19:12:23] Timered checkpoint triggered.
[19:13:34] Writing local files
[19:13:34] Completed 1500000 out of 2500000 steps (60 percent)
[19:28:34] Timered checkpoint triggered.
[19:29:38] Writing local files
[19:29:38] Completed 1525000 out of 2500000 steps (61 percent)
[19:44:38] Timered checkpoint triggered.
[19:45:32] Writing local files
[19:45:32] Completed 1550000 out of 2500000 steps (62 percent)
[20:00:32] Timered checkpoint triggered.
[20:01:44] Writing local files
[20:01:45] Completed 1575000 out of 2500000 steps (63 percent)
[20:07:15] - Autosending finished units... [March 15 20:07:15 UTC]
[20:07:15] Trying to send all finished work units
[20:07:15] + No unsent completed units remaining.
[20:07:15] - Autosend completed
[20:16:44] Timered checkpoint triggered.
[20:17:25] Writing local files
[20:17:26] Completed 1600000 out of 2500000 steps (64 percent)
[20:32:25] Timered checkpoint triggered.
[20:32:35] Writing local files
[20:32:35] Completed 1625000 out of 2500000 steps (65 percent)
[20:47:35] Timered checkpoint triggered.
[20:47:52] Writing local files
[20:47:52] Completed 1650000 out of 2500000 steps (66 percent)
[21:02:52] Timered checkpoint triggered.
[21:03:06] Writing local files
[21:03:07] Completed 1675000 out of 2500000 steps (67 percent)
[21:18:07] Timered checkpoint triggered.
[21:18:53] Writing local files
[21:18:53] Completed 1700000 out of 2500000 steps (68 percent)
[21:33:53] Timered checkpoint triggered.
[21:34:46] Writing local files
[21:34:47] Completed 1725000 out of 2500000 steps (69 percent)
[21:49:46] Timered checkpoint triggered.
[21:50:58] Writing local files
[21:50:59] Completed 1750000 out of 2500000 steps (70 percent)
[22:05:58] Timered checkpoint triggered.
[22:07:16] Writing local files
[22:07:16] Completed 1775000 out of 2500000 steps (71 percent)
[22:22:15] Timered checkpoint triggered.
[22:23:02] Writing local files
[22:23:03] Completed 1800000 out of 2500000 steps (72 percent)
[22:38:03] Timered checkpoint triggered.
[22:39:18] Writing local files
[22:39:18] Completed 1825000 out of 2500000 steps (73 percent)
[22:54:18] Timered checkpoint triggered.
[22:55:33] Writing local files
[22:55:33] Completed 1850000 out of 2500000 steps (74 percent)
[23:10:33] Timered checkpoint triggered.
[23:11:44] Writing local files
[23:11:45] Completed 1875000 out of 2500000 steps (75 percent)
[23:26:44] Timered checkpoint triggered.
[23:27:43] Writing local files
[23:27:43] Completed 1900000 out of 2500000 steps (76 percent)
[23:42:43] Timered checkpoint triggered.
[23:42:53] Writing local files
[23:42:53] Completed 1925000 out of 2500000 steps (77 percent)
[23:57:54] Timered checkpoint triggered.
[23:58:04] Writing local files
[23:58:04] Completed 1950000 out of 2500000 steps (78 percent)
[00:13:04] Timered checkpoint triggered.
[00:13:34] Writing local files
[00:13:34] Completed 1975000 out of 2500000 steps (79 percent)
[00:28:34] Timered checkpoint triggered.
[00:29:19] Writing local files
[00:29:19] Completed 2000000 out of 2500000 steps (80 percent)
[00:44:19] Timered checkpoint triggered.
[00:45:06] Writing local files
[00:45:06] Completed 2025000 out of 2500000 steps (81 percent)
[01:00:06] Timered checkpoint triggered.
[01:00:58] Writing local files
[01:00:58] Completed 2050000 out of 2500000 steps (82 percent)
[01:15:59] Timered checkpoint triggered.
[01:16:54] Writing local files
[01:16:54] Completed 2075000 out of 2500000 steps (83 percent)
[01:31:54] Timered checkpoint triggered.
[01:32:56] Writing local files
[01:32:56] Completed 2100000 out of 2500000 steps (84 percent)
[01:47:55] Timered checkpoint triggered.
[01:48:59] Writing local files
[01:48:59] Completed 2125000 out of 2500000 steps (85 percent)
[02:03:59] Timered checkpoint triggered.
[02:04:15] Writing local files
[02:04:15] Completed 2150000 out of 2500000 steps (86 percent)
[02:07:15] - Autosending finished units... [March 16 02:07:15 UTC]
[02:07:15] Trying to send all finished work units
[02:07:15] + No unsent completed units remaining.
[02:07:15] - Autosend completed
[02:19:14] Timered checkpoint triggered.
[02:19:24] Writing local files
[02:19:24] Completed 2175000 out of 2500000 steps (87 percent)
[02:34:25] Timered checkpoint triggered.
[02:35:12] Writing local files
[02:35:13] Completed 2200000 out of 2500000 steps (88 percent)
[02:50:12] Timered checkpoint triggered.
[02:51:07] Writing local files
[02:51:08] Completed 2225000 out of 2500000 steps (89 percent)
[03:06:07] Timered checkpoint triggered.
[03:06:43] Writing local files
[03:06:43] Completed 2250000 out of 2500000 steps (90 percent)
[03:21:43] Timered checkpoint triggered.
[03:22:43] Writing local files
[03:22:43] Completed 2275000 out of 2500000 steps (91 percent)
[03:37:43] Timered checkpoint triggered.
[03:38:33] Writing local files
[03:38:33] Completed 2300000 out of 2500000 steps (92 percent)
[03:53:34] Timered checkpoint triggered.
[03:54:26] Writing local files
[03:54:26] Completed 2325000 out of 2500000 steps (93 percent)
[04:09:26] Timered checkpoint triggered.
[04:10:24] Writing local files
[04:10:24] Completed 2350000 out of 2500000 steps (94 percent)
[04:25:25] Timered checkpoint triggered.
[04:25:45] Writing local files
[04:25:45] Completed 2375000 out of 2500000 steps (95 percent)
[04:40:45] Timered checkpoint triggered.
[04:40:54] Writing local files
[04:40:54] Completed 2400000 out of 2500000 steps (96 percent)
[04:55:53] Timered checkpoint triggered.
[04:56:24] Writing local files
[04:56:24] Completed 2425000 out of 2500000 steps (97 percent)
[05:11:25] Timered checkpoint triggered.
[05:12:01] Writing local files
[05:12:01] Completed 2450000 out of 2500000 steps (98 percent)
[05:27:01] Timered checkpoint triggered.
[05:28:06] Writing local files
[05:28:06] Completed 2475000 out of 2500000 steps (99 percent)
[05:43:05] Timered checkpoint triggered.
M E G A - F L O P S A C C O U N T I N G
Parallel run - timing based on wallclock.
RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
NF=No Forces
Computing: M-Number M-Flops % of Flops
-----------------------------------------------------------------------
RF Coul 63341.807547 2090279.649051 0.4
RF Coul [W3] 346.776382 33984.085436 0.0
RF Coul + VdW(T) 167922.307839 10914950.009535 2.3
RF Coul + VdW(T) [W3] 30200.857433 3926111.466290 0.8
RF Coul + VdW(T) [W3-W3] 1234505.060554 395041619.377280 83.7
Outer nonbonded loop 295281.347136 2952813.471360 0.6
1,4 nonbonded interactions 8282.503313 745425.298170 0.2
NS-Pairs 529175.037426 11112675.785946 2.4
Reset In Box 16682.066728 150138.600552 0.0
Shift-X 333635.133454 2001810.800724 0.4
CG-CoM 5770.273081 167337.919349 0.0
Sum Forces 500460.200184 500460.200184 0.1
Bonds 1565.000626 67295.026918 0.0
Angles 5752.502301 937657.875063 0.2
Propers 582.500233 133392.553357 0.0
RB-Dihedrals 6610.002644 1632670.653068 0.3
Virial 167090.066836 3007621.203048 0.6
Ext.ens. Update 166820.066728 9008283.603312 1.9
Stop-CM 16682.000000 166820.000000 0.0
Calc-Ekin 166820.133456 4504143.603312 1.0
Shake 3726.091117 111782.733510 0.0
Constraint-V 166820.066728 1000920.400368 0.2
Shake-Init 1600.000640 16000.006400 0.0
Constraint-Vir 165272.566109 3966541.586616 0.8
Settle 54557.521823 17622079.548829 3.7
-----------------------------------------------------------------------
Total 471812815.457678 100.0
-----------------------------------------------------------------------
NODE (s) Real (s) (%)
Time: 94477.000 94477.000 100.0
1d02h14:37
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 15.838 4.994 4.573 5.249
[05:44:19] Writing local files
[05:44:19] Completed 2500000 out of 2500000 steps (100 percent)
[05:44:19] Writing final coordinates.
[05:44:19] Past main M.D. loop
[05:44:19] Will end MPI now
[05:45:19]
[05:45:19] Finished Work Unit:
[05:45:19] - Reading up to 1601592 from "work/wudata_01.arc": Read 1601592
[05:45:19] - Reading up to 488384 from "work/wudata_01.xtc": Read 488384
[05:45:19] goefile size: 0
[05:45:19] logfile size: 76423
[05:45:19] Leaving Run
[05:45:22] - Writing 2268243 bytes of core data to disk...
[05:45:22] ... Done.
[05:45:23] - Shutting down core
[05:45:23]
[05:45:23] Folding@home Core Shutdown: FINISHED_UNIT
[08:07:15] - Autosending finished units... [March 16 08:07:15 UTC]
[08:07:15] Trying to send all finished work units
[08:07:15] + No unsent completed units remaining.
[08:07:15] - Autosend completed
[14:07:15] - Autosending finished units... [March 16 14:07:15 UTC]
[14:07:15] Trying to send all finished work units
[14:07:15] + No unsent completed units remaining.
[14:07:15] - Autosend completed
[17:12:53] ***** Got an Activate signal (2)
[17:12:53] Killing all core threads
Folding@Home Client Shutdown.
share@computenode:~/fah1> ./fah1 -smp 4 -verbosity 9
Note: Please read the license agreement (fah1 -license). Further
use of this software requires that you have read and accepted this agreement.
8 cores detected
--- Opening Log file [March 16 17:13:15 UTC]
# Linux SMP Console Edition ###################################################
###############################################################################
Folding@Home Client Version 6.24beta
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: /share/fah1
Executable: ./fah1
Arguments: -smp 4 -verbosity 9
[17:13:15] - Ask before connecting: No
[17:13:15] - User name: alpha754293 (Team 596)
[17:13:15] - User ID: 47FBD1D4056DB49E
[17:13:15] - Machine ID: 1
[17:13:15]
[17:13:15] Loaded queue successfully.
[17:13:15]
[17:13:15] - Autosending finished units... [March 16 17:13:15 UTC]
[17:13:15] + Processing work unit
[17:13:15] Trying to send all finished work units
[17:13:15] Work type a1 not eligible for variable processors
[17:13:15] + No unsent completed units remaining.
[17:13:15] Core required: FahCore_a1.exe
[17:13:15] - Autosend completed
[17:13:15] Core found.
[17:13:15] Working on queue slot 01 [March 16 17:13:15 UTC]
[17:13:15] + Working ...
[17:13:15] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 01 -checkpoint 15 -verbose -lifeline 32238 -version 624'
[17:13:15]
[17:13:15] *------------------------------*
[17:13:15] Folding@Home Gromacs SMP Core
[17:13:15] Version 1.74 (November 27, 2006)
[17:13:15]
[17:13:15] Preparing to commence simulation
[17:13:15] - Ensuring status. Please wait.
[17:13:15]
[17:13:15] Project: 0 (Run 0, Clone 0, Gen 0)
[17:13:15]
[17:13:15] Error: Could not write local file. Exiting.
[17:13:20] - Shutting down core
[17:13:32] put
[17:13:32] - Starting from initial work packet
[17:13:32]
[17:13:32] Project: 0 (Run 0, Clone 0, Gen 0)
[17:13:32]
[17:13:32] Error: Could not write local file. Exiting.
[17:13:37] - Shutting down core
[0]0:Return code = 18
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[17:15:24] CoreStatus = 12 (18)
[17:15:24] Client-core communications error: ERROR 0x12
[17:15:24] Deleting current work unit & continuing...
[17:19:46] - Warning: Could not delete all work unit files (1): Core returned invalid code
[17:19:46] Trying to send all finished work units
[17:19:46] + No unsent completed units remaining.
[17:19:46] - Preparing to get new work unit...
[17:19:46] + Attempting to get work packet
[17:19:46] - Will indicate memory of 16003 MB
[17:19:46] - Connecting to assignment server
[17:19:46] Connecting to http://assign.stanford.edu:8080/
[17:19:47] Posted data.
[17:19:47] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[17:19:47] + News From Folding@Home: Welcome to Folding@Home
[17:19:47] Loaded queue successfully.
[17:19:47] Connecting to http://171.64.65.64:8080/
[17:19:50] Posted data.
[17:19:50] Initial: 0000; - Receiving payload (expected size: 2438512)
[17:19:57] - Downloaded at ~340 kB/s
[17:19:57] - Averaged speed for that direction ~309 kB/s
[17:19:57] + Received work.
[17:19:57] + Closed connections
[17:20:02]
[17:20:02] + Processing work unit
[17:20:02] Work type a1 not eligible for variable processors
[17:20:02] Core required: FahCore_a1.exe
[17:20:02] Core found.
[17:20:02] Working on queue slot 02 [March 16 17:20:02 UTC]
[17:20:02] + Working ...
[17:20:02] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 32238 -version 624'
[17:20:02]
[17:20:02] *------------------------------*
[17:20:02] Folding@Home Gromacs SMP Core
[17:20:02] Version 1.74 (November 27, 2006)
[17:20:02]
[17:20:02] Preparing to commence simulation
[17:20:02] - Ensuring status. Please wait.
[17:20:19] - Looking at optimizations...
[17:20:19] - Working with standard loops on this execution.
[17:20:19] - Previous termination of core was improper.
[17:20:19] - Going to use standard loops.
[17:20:19] - Files status OK
[17:20:19] Starting from initial work packet
[17:20:19]
[17:20:19] Project: 2653 (Run 36, Cl- Starting from initial work packet
[17:20:19]
[17:20:19] Project: 265Entering M.D.
[17:20:19] ne 17, Gen 134)
[17:20:19]
[17:20:20] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=computenode
NNODES=4, MYRANK=1, HOSTNAME=computenode
NNODES=4, MYRANK=3, HOSTNAME=computenode
NNODES=4, MYRANK=2, HOSTNAME=computenode
NODEID=0 argc=15
NODEID=1 argc=15
NODEID=3 argc=15
NODEID=2 argc=15
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2004, The GROMACS development team,
check out http://www.gromacs.org for more information.
This inclusion of Gromacs code in the Folding@Home Core is under
a special license (see http://folding.stanford.edu/gromacs.html)
specially granted to Stanford by the copyright holders. If you
are interested in using Gromacs, visit www.gromacs.org where
you can download a free version of Gromacs under
the terms of the GNU General Public License (GPL) as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.
[17:20:26] Protein: Protein in POPC
[17:20:26] Writing local files
starting mdrun 'Protein in POPC'
500000 steps, 1000.0 ps.
[17:20:27] boost OK.
[17:20:27] boost OK.
[17:20:27] cal files
[17:20:27] Completed 0 out of 500000 steps (0 percent)
[17:33:54] Writing local files
[17:33:54] Completed 5000 out of 500000 steps (1 percent)
[17:47:15] Writing local files
[17:47:15] Completed 10000 out of 500000 steps (2 percent)
[18:00:43] Writing local files
[18:00:43] Completed 15000 out of 500000 steps (3 percent)
[18:14:12] Writing local files
[18:14:12] Completed 20000 out of 500000 steps (4 percent)
[18:27:40] Writing local files
[18:27:40] Completed 25000 out of 500000 steps (5 percent)
I guess my question is what's Project: 0 (Run 0, Clone 0, Gen 0)?
There's no entry core Core Status: 12 (18).
Note also there's no assignment or attempted connection to assignment server prior to starting P0R0C0G0.
Re: Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Mon Mar 16, 2009 6:38 pm
by 7im
0.0.0.0 is nothing. It's a place holder, and a way of indicating there are either no work units available for your configuration, or one of the other reasons listed in the wiki.
Re: Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Mon Mar 16, 2009 6:48 pm
by alpha754293
7im wrote:0.0.0.0 is nothing. It's a place holder, and a way of indicating there are either no work units available for your configuration, or one of the other reasons listed in the wiki.
Well, for me, there's a HUGE difference between no address at all and 0.0.0.0 or even 127.0.0.1. Even as a placeholder.
I'm sure that we can debate the semantics some other time, but the point is that I did not see any entries in the log pertaining to assignment prior to it's attempt to start working on Project: 0 (Run 0, Clone 0, Gen 0).
Which, either means that it is a legitmate WU (however unlikely) or that there was something wrong since the last WU finish that it couldn't do anything for 3 hours and no status message that states that the client has gone into a holding pattern, which may be indicative of a larger, systemic issue (either with the hardware, as it is possible), or with the client, or with the server, or with the completion of the previous WU.
That's like saying NaN or Inf. or -Inf. = 1.E-30, or 0. There's a HUGE difference between those. In any case, what's P0R0C0G0? What's core status 12?
*edit*
Where's the line in the log file that says that the assignment server is 0.0.0.0?
Re: Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Mon Mar 16, 2009 7:41 pm
by anandhanju
I vaguely remember seeing a report like this sometime last year. Not a Linux user so not really sure of this. I think Project: 0 (Run 0, Clone 0, Gen 0) refers to a corrupt queue entry. In your case, slot 1 happened to contain the WU that got stuck while finalizing results and this may have gummed up the queue. When you restarted the client, the queue entry was found to be invalid and deleted. I'm pretty sure the results at slot 01 were lost.
Edit: Related posts: viewtopic.php?f=44&t=4321 and viewtopic.php?f=19&t=2869
Re: Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Mon Mar 16, 2009 7:44 pm
by alpha754293
Hmm...that's weird because it said that the WU finished and then it couldn't send the results and/or the autosend didn't pick up on it. *shrug*
*edit*
Thanks for the links.
Re: Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Tue Mar 17, 2009 4:42 am
by Zagen30
This happened to me today, except that I hadn't finished a WU (I had closed in the middle of one which was running normally) and it did this on every attempt to restart the client. Eventually I found my way to the wiki (there is an entry for corestatus 12 (18):
http://fahwiki.net/index.php/CoreStatus_codes#12), saw that that status is due to issues with the queue, and let the client delete the old stuff and download a new WU (just waited a few minutes, didn't delete any files). I lost 64% of a 2653, but I'll live.
BTW, I had not touched the config file, nor had I touched any other files that related to the Linux client. I had even closed the client earlier in the WU and it had restarted just fine. Could this be cause due to restarting a virtual Linux box too soon after ctrl-c-ing out of the client?
Re: Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Tue Mar 17, 2009 4:46 am
by alpha754293
Zagen30 wrote:This happened to me today, except that I hadn't finished a WU (I had closed in the middle of one which was running normally) and it did this on every attempt to restart the client. Eventually I found my way to the wiki (there is an entry for corestatus 12 (18):
http://fahwiki.net/index.php/CoreStatus_codes#12), saw that that status is due to issues with the queue, and let the client delete the old stuff and download a new WU (just waited a few minutes, didn't delete any files). I lost 64% of a 2653, but I'll live.
BTW, I had not touched the config file, nor had I touched any other files that related to the Linux client. I had even closed the client earlier in the WU and it had restarted just fine. Could this be cause due to restarting a virtual Linux box too soon after ctrl-c-ing out of the client?
Oh...lol. I must have overlooked it. oops. my bad.
I lost 100% and it went "numb" after 3 hours so I had to CTRL+C outta there and restart the client manually after running it unsupervised for quite some time.
Re: Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Tue Mar 17, 2009 5:21 am
by bruce
Project 0, run 0, clone 0, gen 0 is a WU that isn't there. In rare cases when the previous WU has an error, the queue is updated to point to the next position before the WU is downloaded and then some kind of error happens that makes the client believe that something is there that should be processed. The FahCore is unable to process it, of course, and the client moves on to download a new assignment. [You've already figured most of this out yourself.]
The actual cause of this phantom WU has never been clearly identified but if you look at your log, the previous WU never finished and you had to kill FAH before it successfully moved on to the next WU. Upon restart, it found that phantom WU.
@7im:
Sorry, but you're thinking about the server at IP address 0.0.0.0 which has nothing to do with a WU with PRCG = 0, 0, 0, 0.
Re: Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Tue Mar 17, 2009 5:29 am
by 7im
Hmmm.... sounds like a new wiki entry coming so I don't confuse them again...

Re: Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Tue Mar 17, 2009 10:11 am
by alpha754293
Do you know why or what would cause this? I had my system running for about 14 days straight without any problems until now. And I only noticed it when I saw that the FahMon didn't seem to be updating that client like it should, so that's when I started checking the logs.
Re: Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Tue Mar 17, 2009 4:58 pm
by bruce
The critical message in the FAHlog that you posted is "Shutting down core" That message should always be followed by
"Folding@home Core Shutdown: FINISHED_UNIT
Folding@home Core Shutdown: FINISHED_UNIT
CoreStatus = 64 (100)
Sending work to server"
and in your case, that didn't happen. (You already said as much in an earlier post.) At that point, if you had checked, I suspect that at least one copy of FahCore_a1 was still running. Once your system is hung in that condition, what happens next, including the bogus WU is a result of the initial problem. In other words, you can ignore WU 0,0,0,0 because it's not the problem, the system hang is the problem.
In the "known bugs" list you'll find several reasons why FahCore_a1 hangs, but most notably it's probably a change in your network, including DHCP renewing an address, a WiFi connection going out of range, etc. Some later versions of the Linux Nucleus (and Windows Vista, for that matter) contain a new IP stack which fix this problem but somebody else will have to tell you which ones. One work-around that MIGHT help is to use a fixed IP address on your LAN but I can't promise that will work.
Re: Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Tue Mar 17, 2009 5:03 pm
by alpha754293
bruce wrote:The critical message in the FAHlog that you posted is "Shutting down core" That message should always be followed by
"Folding@home Core Shutdown: FINISHED_UNIT
Folding@home Core Shutdown: FINISHED_UNIT
CoreStatus = 64 (100)
Sending work to server"
and in your case, that didn't happen. (You already said as much in an earlier post.) At that point, if you had checked, I suspect that at least one copy of FahCore_a1 was still running. Once your system is hung in that condition, what happens next, including the bogus WU is a result of the initial problem. In other words, you can ignore WU 0,0,0,0 because it's not the problem, the system hang is the problem.
In the "known bugs" list you'll find several reasons why FahCore_a1 hangs, but most notably it's probably a change in your network, including DHCP renewing an address, a WiFi connection going out of range, etc. Some later versions of the Linux Nucleus (and Windows Vista, for that matter) contain a new IP stack which fix this problem but somebody else will have to tell you which ones. One work-around that MIGHT help is to use a fixed IP address on your LAN but I can't promise that will work.
Actually, no. I checked it. Wait. correction. I don't know. I only checked it after CTRL+C to make sure that there are no <defunct> processes still lingering.
Ran into it again just a few minutes ago.
There shouldn't be any changes in the network config. If there were, it would require a power outage since the remainder of the system has an uptime of 15 days on the same address.
AFAIK, I don't think that the IP stack has changed nor the DHCP assignments.
Re: Project: 0 (Run 0, Clone 0, Gen 0)
Posted: Tue Mar 17, 2009 5:10 pm
by alpha754293
Does the a1 core require external network for the MPICH to function?
i.e. in core-to-core communications, does it loopback via external network (i.e. sees the cores as \\<ip_address>\cpu0 etc. and that all communications between cores must go through the IP MAC interface, or is it local distributed MPICH? (no external network required for core-to-core communications)
in traditional HPC applications, and especially in larger installations, all MPICH communications are external to the system. There are probably controls and managers to try and keep as much of as local as possible, but that's also one of the big reasons why IB and Myrinet is so popular because core-to-core communications may not necessarily exist on the local system anymore. If you have a monolithic OS installation, the OS will ennumerate all cores, but it does not take into consideration the physical location/gap between cores.
I'm just wondering if the Fah a1 core is similiar in that respect.
From your reply, if a change in network configuration is sufficient to cause the core to freeze; without any abort, error, or ABT codes; then I would tend to think that it is coded like the HPC model, which should also mean that the F@H client is actually capable of distributed parallel processing provided that the monolithic OS installation is transparent to F@H.