Project: 0 (Run 0, Clone 0, Gen 0)

Moderators: Site Moderators, FAHC Science Team

alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Project: 0 (Run 0, Clone 0, Gen 0)

Post by alpha754293 »

error:

Code: Select all

Launch directory: /share/fah1
Executable: ./fah1
Arguments: -smp 4 -verbosity 9

[17:13:15] - Ask before connecting: No
[17:13:15] - User name: alpha754293 (Team 596)
[17:13:15] - User ID: 47FBD1D4056DB49E
[17:13:15] - Machine ID: 1
[17:13:15]
[17:13:15] Loaded queue successfully.
[17:13:15]
[17:13:15] - Autosending finished units... [March 16 17:13:15 UTC]
[17:13:15] + Processing work unit
[17:13:15] Trying to send all finished work units
[17:13:15] Work type a1 not eligible for variable processors
[17:13:15] + No unsent completed units remaining.
[17:13:15] Core required: FahCore_a1.exe
[17:13:15] - Autosend completed
[17:13:15] Core found.
[17:13:15] Working on queue slot 01 [March 16 17:13:15 UTC]
[17:13:15] + Working ...
[17:13:15] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 01 -checkpoint 15 -verbose -lifeline 32238 -version 624'

[17:13:15]
[17:13:15] *------------------------------*
[17:13:15] Folding@Home Gromacs SMP Core
[17:13:15] Version 1.74 (November 27, 2006)
[17:13:15]
[17:13:15] Preparing to commence simulation
[17:13:15] - Ensuring status. Please wait.
[17:13:15]
[17:13:15] Project: 0 (Run 0, Clone 0, Gen 0)
[17:13:15]
[17:13:15] Error: Could not write local file.  Exiting.
[17:13:20] - Shutting down core
[17:13:32] put
[17:13:32] - Starting from initial work packet
[17:13:32]
[17:13:32] Project: 0 (Run 0, Clone 0, Gen 0)
[17:13:32]
[17:13:32] Error: Could not write local file.  Exiting.
[17:13:37] - Shutting down core
Uh. Help? (I don't even know where to begin on this one).
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Post by 7im »

How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Post by alpha754293 »

7im wrote:Try the fah wiki search... ;) http://fahwiki.net/index.php?title=Why_ ... 0.0.0.0%3F
Problem persisted for about 3 to 3.5 hours. (Not quite the 4 that is mentioned in the wiki page).

I don't know if it assigned any server at all, including 0.0.0.0 or 127.0.0.1.

Here's the full console output since last WU. Note the times.

Code: Select all

[08:07:15] Trying to send all finished work units
[08:07:15] + No unsent completed units remaining.
[08:07:15] - Autosend completed
[08:13:04] Timered checkpoint triggered.
[08:14:03] Writing local files
[08:14:03] Completed 450000 out of 2500000 steps  (18 percent)
[08:29:03] Timered checkpoint triggered.
[08:30:00] Writing local files
[08:30:00] Completed 475000 out of 2500000 steps  (19 percent)
[08:45:00] Timered checkpoint triggered.
[08:45:26] Writing local files
[08:45:26] Completed 500000 out of 2500000 steps  (20 percent)
[09:00:25] Timered checkpoint triggered.
[09:00:52] Writing local files
[09:00:52] Completed 525000 out of 2500000 steps  (21 percent)
[09:15:52] Timered checkpoint triggered.
[09:16:18] Writing local files
[09:16:18] Completed 550000 out of 2500000 steps  (22 percent)
[09:31:18] Timered checkpoint triggered.
[09:32:12] Writing local files
[09:32:12] Completed 575000 out of 2500000 steps  (23 percent)
[09:47:11] Timered checkpoint triggered.
[09:48:33] Writing local files
[09:48:33] Completed 600000 out of 2500000 steps  (24 percent)
[10:03:33] Timered checkpoint triggered.
[10:04:57] Writing local files
[10:04:57] Completed 625000 out of 2500000 steps  (25 percent)
[10:19:56] Timered checkpoint triggered.
[10:21:22] Writing local files
[10:21:22] Completed 650000 out of 2500000 steps  (26 percent)
[10:36:22] Timered checkpoint triggered.
[10:37:24] Writing local files
[10:37:24] Completed 675000 out of 2500000 steps  (27 percent)
[10:52:23] Timered checkpoint triggered.
[10:53:20] Writing local files
[10:53:20] Completed 700000 out of 2500000 steps  (28 percent)
[11:08:19] Timered checkpoint triggered.
[11:09:18] Writing local files
[11:09:18] Completed 725000 out of 2500000 steps  (29 percent)
[11:24:18] Timered checkpoint triggered.
[11:25:08] Writing local files
[11:25:08] Completed 750000 out of 2500000 steps  (30 percent)
[11:40:07] Timered checkpoint triggered.
[11:40:57] Writing local files
[11:40:57] Completed 775000 out of 2500000 steps  (31 percent)
[11:55:57] Timered checkpoint triggered.
[11:56:47] Writing local files
[11:56:47] Completed 800000 out of 2500000 steps  (32 percent)
[12:11:47] Timered checkpoint triggered.
[12:12:21] Writing local files
[12:12:21] Completed 825000 out of 2500000 steps  (33 percent)
[12:27:20] Timered checkpoint triggered.
[12:27:30] Writing local files
[12:27:30] Completed 850000 out of 2500000 steps  (34 percent)
[12:42:30] Timered checkpoint triggered.
[12:42:41] Writing local files
[12:42:41] Completed 875000 out of 2500000 steps  (35 percent)
[12:57:41] Timered checkpoint triggered.
[12:58:31] Writing local files
[12:58:31] Completed 900000 out of 2500000 steps  (36 percent)
[13:13:31] Timered checkpoint triggered.
[13:14:40] Writing local files
[13:14:40] Completed 925000 out of 2500000 steps  (37 percent)
[13:29:40] Timered checkpoint triggered.
[13:30:29] Writing local files
[13:30:29] Completed 950000 out of 2500000 steps  (38 percent)
[13:45:29] Timered checkpoint triggered.
[13:46:12] Writing local files
[13:46:12] Completed 975000 out of 2500000 steps  (39 percent)
[14:01:12] Timered checkpoint triggered.
[14:01:50] Writing local files
[14:01:50] Completed 1000000 out of 2500000 steps  (40 percent)
[14:07:15] - Autosending finished units... [March 15 14:07:15 UTC]
[14:07:15] Trying to send all finished work units
[14:07:15] + No unsent completed units remaining.
[14:07:15] - Autosend completed
[14:16:50] Timered checkpoint triggered.
[14:17:00] Writing local files
[14:17:00] Completed 1025000 out of 2500000 steps  (41 percent)
[14:32:00] Timered checkpoint triggered.
[14:32:14] Writing local files
[14:32:14] Completed 1050000 out of 2500000 steps  (42 percent)
[14:47:14] Timered checkpoint triggered.
[14:47:26] Writing local files
[14:47:26] Completed 1075000 out of 2500000 steps  (43 percent)
[15:02:26] Timered checkpoint triggered.
[15:02:35] Writing local files
[15:02:35] Completed 1100000 out of 2500000 steps  (44 percent)
[15:17:35] Timered checkpoint triggered.
[15:18:35] Writing local files
[15:18:36] Completed 1125000 out of 2500000 steps  (45 percent)
[15:33:35] Timered checkpoint triggered.
[15:34:29] Writing local files
[15:34:29] Completed 1150000 out of 2500000 steps  (46 percent)
[15:49:29] Timered checkpoint triggered.
[15:50:44] Writing local files
[15:50:44] Completed 1175000 out of 2500000 steps  (47 percent)
[16:05:44] Timered checkpoint triggered.
[16:06:53] Writing local files
[16:06:53] Completed 1200000 out of 2500000 steps  (48 percent)
[16:21:52] Timered checkpoint triggered.
[16:22:27] Writing local files
[16:22:27] Completed 1225000 out of 2500000 steps  (49 percent)
[16:37:27] Timered checkpoint triggered.
[16:37:41] Writing local files
[16:37:41] Completed 1250000 out of 2500000 steps  (50 percent)
[16:52:41] Timered checkpoint triggered.
[16:53:28] Writing local files
[16:53:28] Completed 1275000 out of 2500000 steps  (51 percent)
[17:08:28] Timered checkpoint triggered.
[17:09:18] Writing local files
[17:09:18] Completed 1300000 out of 2500000 steps  (52 percent)
[17:24:18] Timered checkpoint triggered.
[17:25:07] Writing local files
[17:25:08] Completed 1325000 out of 2500000 steps  (53 percent)
[17:40:07] Timered checkpoint triggered.
[17:40:54] Writing local files
[17:40:54] Completed 1350000 out of 2500000 steps  (54 percent)
[17:55:53] Timered checkpoint triggered.
[17:56:45] Writing local files
[17:56:45] Completed 1375000 out of 2500000 steps  (55 percent)
[18:11:45] Timered checkpoint triggered.
[18:11:55] Writing local files
[18:11:55] Completed 1400000 out of 2500000 steps  (56 percent)
[18:26:54] Timered checkpoint triggered.
[18:27:05] Writing local files
[18:27:05] Completed 1425000 out of 2500000 steps  (57 percent)
[18:42:06] Timered checkpoint triggered.
[18:42:13] Writing local files
[18:42:13] Completed 1450000 out of 2500000 steps  (58 percent)
[18:57:13] Timered checkpoint triggered.
[18:57:23] Writing local files
[18:57:23] Completed 1475000 out of 2500000 steps  (59 percent)
[19:12:23] Timered checkpoint triggered.
[19:13:34] Writing local files
[19:13:34] Completed 1500000 out of 2500000 steps  (60 percent)
[19:28:34] Timered checkpoint triggered.
[19:29:38] Writing local files
[19:29:38] Completed 1525000 out of 2500000 steps  (61 percent)
[19:44:38] Timered checkpoint triggered.
[19:45:32] Writing local files
[19:45:32] Completed 1550000 out of 2500000 steps  (62 percent)
[20:00:32] Timered checkpoint triggered.
[20:01:44] Writing local files
[20:01:45] Completed 1575000 out of 2500000 steps  (63 percent)
[20:07:15] - Autosending finished units... [March 15 20:07:15 UTC]
[20:07:15] Trying to send all finished work units
[20:07:15] + No unsent completed units remaining.
[20:07:15] - Autosend completed
[20:16:44] Timered checkpoint triggered.
[20:17:25] Writing local files
[20:17:26] Completed 1600000 out of 2500000 steps  (64 percent)
[20:32:25] Timered checkpoint triggered.
[20:32:35] Writing local files
[20:32:35] Completed 1625000 out of 2500000 steps  (65 percent)
[20:47:35] Timered checkpoint triggered.
[20:47:52] Writing local files
[20:47:52] Completed 1650000 out of 2500000 steps  (66 percent)
[21:02:52] Timered checkpoint triggered.
[21:03:06] Writing local files
[21:03:07] Completed 1675000 out of 2500000 steps  (67 percent)
[21:18:07] Timered checkpoint triggered.
[21:18:53] Writing local files
[21:18:53] Completed 1700000 out of 2500000 steps  (68 percent)
[21:33:53] Timered checkpoint triggered.
[21:34:46] Writing local files
[21:34:47] Completed 1725000 out of 2500000 steps  (69 percent)
[21:49:46] Timered checkpoint triggered.
[21:50:58] Writing local files
[21:50:59] Completed 1750000 out of 2500000 steps  (70 percent)
[22:05:58] Timered checkpoint triggered.
[22:07:16] Writing local files
[22:07:16] Completed 1775000 out of 2500000 steps  (71 percent)
[22:22:15] Timered checkpoint triggered.
[22:23:02] Writing local files
[22:23:03] Completed 1800000 out of 2500000 steps  (72 percent)
[22:38:03] Timered checkpoint triggered.
[22:39:18] Writing local files
[22:39:18] Completed 1825000 out of 2500000 steps  (73 percent)
[22:54:18] Timered checkpoint triggered.
[22:55:33] Writing local files
[22:55:33] Completed 1850000 out of 2500000 steps  (74 percent)
[23:10:33] Timered checkpoint triggered.
[23:11:44] Writing local files
[23:11:45] Completed 1875000 out of 2500000 steps  (75 percent)
[23:26:44] Timered checkpoint triggered.
[23:27:43] Writing local files
[23:27:43] Completed 1900000 out of 2500000 steps  (76 percent)
[23:42:43] Timered checkpoint triggered.
[23:42:53] Writing local files
[23:42:53] Completed 1925000 out of 2500000 steps  (77 percent)
[23:57:54] Timered checkpoint triggered.
[23:58:04] Writing local files
[23:58:04] Completed 1950000 out of 2500000 steps  (78 percent)
[00:13:04] Timered checkpoint triggered.
[00:13:34] Writing local files
[00:13:34] Completed 1975000 out of 2500000 steps  (79 percent)
[00:28:34] Timered checkpoint triggered.
[00:29:19] Writing local files
[00:29:19] Completed 2000000 out of 2500000 steps  (80 percent)
[00:44:19] Timered checkpoint triggered.
[00:45:06] Writing local files
[00:45:06] Completed 2025000 out of 2500000 steps  (81 percent)
[01:00:06] Timered checkpoint triggered.
[01:00:58] Writing local files
[01:00:58] Completed 2050000 out of 2500000 steps  (82 percent)
[01:15:59] Timered checkpoint triggered.
[01:16:54] Writing local files
[01:16:54] Completed 2075000 out of 2500000 steps  (83 percent)
[01:31:54] Timered checkpoint triggered.
[01:32:56] Writing local files
[01:32:56] Completed 2100000 out of 2500000 steps  (84 percent)
[01:47:55] Timered checkpoint triggered.
[01:48:59] Writing local files
[01:48:59] Completed 2125000 out of 2500000 steps  (85 percent)
[02:03:59] Timered checkpoint triggered.
[02:04:15] Writing local files
[02:04:15] Completed 2150000 out of 2500000 steps  (86 percent)
[02:07:15] - Autosending finished units... [March 16 02:07:15 UTC]
[02:07:15] Trying to send all finished work units
[02:07:15] + No unsent completed units remaining.
[02:07:15] - Autosend completed
[02:19:14] Timered checkpoint triggered.
[02:19:24] Writing local files
[02:19:24] Completed 2175000 out of 2500000 steps  (87 percent)
[02:34:25] Timered checkpoint triggered.
[02:35:12] Writing local files
[02:35:13] Completed 2200000 out of 2500000 steps  (88 percent)
[02:50:12] Timered checkpoint triggered.
[02:51:07] Writing local files
[02:51:08] Completed 2225000 out of 2500000 steps  (89 percent)
[03:06:07] Timered checkpoint triggered.
[03:06:43] Writing local files
[03:06:43] Completed 2250000 out of 2500000 steps  (90 percent)
[03:21:43] Timered checkpoint triggered.
[03:22:43] Writing local files
[03:22:43] Completed 2275000 out of 2500000 steps  (91 percent)
[03:37:43] Timered checkpoint triggered.
[03:38:33] Writing local files
[03:38:33] Completed 2300000 out of 2500000 steps  (92 percent)
[03:53:34] Timered checkpoint triggered.
[03:54:26] Writing local files
[03:54:26] Completed 2325000 out of 2500000 steps  (93 percent)
[04:09:26] Timered checkpoint triggered.
[04:10:24] Writing local files
[04:10:24] Completed 2350000 out of 2500000 steps  (94 percent)
[04:25:25] Timered checkpoint triggered.
[04:25:45] Writing local files
[04:25:45] Completed 2375000 out of 2500000 steps  (95 percent)
[04:40:45] Timered checkpoint triggered.
[04:40:54] Writing local files
[04:40:54] Completed 2400000 out of 2500000 steps  (96 percent)
[04:55:53] Timered checkpoint triggered.
[04:56:24] Writing local files
[04:56:24] Completed 2425000 out of 2500000 steps  (97 percent)
[05:11:25] Timered checkpoint triggered.
[05:12:01] Writing local files
[05:12:01] Completed 2450000 out of 2500000 steps  (98 percent)
[05:27:01] Timered checkpoint triggered.
[05:28:06] Writing local files
[05:28:06] Completed 2475000 out of 2500000 steps  (99 percent)
[05:43:05] Timered checkpoint triggered.



        M E G A - F L O P S   A C C O U N T I N G

        Parallel run - timing based on wallclock.
   RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
   T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
   NF=No Forces

 Computing:                        M-Number         M-Flops  % of Flops
-----------------------------------------------------------------------
 RF Coul                       63341.807547  2090279.649051     0.4
 RF Coul [W3]                    346.776382    33984.085436     0.0
 RF Coul + VdW(T)             167922.307839 10914950.009535     2.3
 RF Coul + VdW(T) [W3]         30200.857433  3926111.466290     0.8
 RF Coul + VdW(T) [W3-W3]    1234505.060554 395041619.377280    83.7
 Outer nonbonded loop         295281.347136  2952813.471360     0.6
 1,4 nonbonded interactions     8282.503313   745425.298170     0.2
 NS-Pairs                     529175.037426 11112675.785946     2.4
 Reset In Box                  16682.066728   150138.600552     0.0
 Shift-X                      333635.133454  2001810.800724     0.4
 CG-CoM                         5770.273081   167337.919349     0.0
 Sum Forces                   500460.200184   500460.200184     0.1
 Bonds                          1565.000626    67295.026918     0.0
 Angles                         5752.502301   937657.875063     0.2
 Propers                         582.500233   133392.553357     0.0
 RB-Dihedrals                   6610.002644  1632670.653068     0.3
 Virial                       167090.066836  3007621.203048     0.6
 Ext.ens. Update              166820.066728  9008283.603312     1.9
 Stop-CM                       16682.000000   166820.000000     0.0
 Calc-Ekin                    166820.133456  4504143.603312     1.0
 Shake                          3726.091117   111782.733510     0.0
 Constraint-V                 166820.066728  1000920.400368     0.2
 Shake-Init                     1600.000640    16000.006400     0.0
 Constraint-Vir               165272.566109  3966541.586616     0.8
 Settle                        54557.521823 17622079.548829     3.7
-----------------------------------------------------------------------
 Total                                      471812815.457678   100.0
-----------------------------------------------------------------------

               NODE (s)   Real (s)      (%)
       Time:  94477.000  94477.000    100.0
                       1d02h14:37
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:     15.838      4.994      4.573      5.249
[05:44:19] Writing local files
[05:44:19] Completed 2500000 out of 2500000 steps  (100 percent)
[05:44:19] Writing final coordinates.
[05:44:19] Past main M.D. loop
[05:44:19] Will end MPI now
[05:45:19]
[05:45:19] Finished Work Unit:
[05:45:19] - Reading up to 1601592 from "work/wudata_01.arc": Read 1601592
[05:45:19] - Reading up to 488384 from "work/wudata_01.xtc": Read 488384
[05:45:19] goefile size: 0
[05:45:19] logfile size: 76423
[05:45:19] Leaving Run
[05:45:22] - Writing 2268243 bytes of core data to disk...
[05:45:22]   ... Done.
[05:45:23] - Shutting down core
[05:45:23]
[05:45:23] Folding@home Core Shutdown: FINISHED_UNIT
[08:07:15] - Autosending finished units... [March 16 08:07:15 UTC]
[08:07:15] Trying to send all finished work units
[08:07:15] + No unsent completed units remaining.
[08:07:15] - Autosend completed
[14:07:15] - Autosending finished units... [March 16 14:07:15 UTC]
[14:07:15] Trying to send all finished work units
[14:07:15] + No unsent completed units remaining.
[14:07:15] - Autosend completed
[17:12:53] ***** Got an Activate signal (2)
[17:12:53] Killing all core threads

Folding@Home Client Shutdown.
share@computenode:~/fah1> ./fah1 -smp 4 -verbosity 9

Note: Please read the license agreement (fah1 -license). Further
use of this software requires that you have read and accepted this agreement.

8 cores detected


--- Opening Log file [March 16 17:13:15 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /share/fah1
Executable: ./fah1
Arguments: -smp 4 -verbosity 9

[17:13:15] - Ask before connecting: No
[17:13:15] - User name: alpha754293 (Team 596)
[17:13:15] - User ID: 47FBD1D4056DB49E
[17:13:15] - Machine ID: 1
[17:13:15]
[17:13:15] Loaded queue successfully.
[17:13:15]
[17:13:15] - Autosending finished units... [March 16 17:13:15 UTC]
[17:13:15] + Processing work unit
[17:13:15] Trying to send all finished work units
[17:13:15] Work type a1 not eligible for variable processors
[17:13:15] + No unsent completed units remaining.
[17:13:15] Core required: FahCore_a1.exe
[17:13:15] - Autosend completed
[17:13:15] Core found.
[17:13:15] Working on queue slot 01 [March 16 17:13:15 UTC]
[17:13:15] + Working ...
[17:13:15] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 01 -checkpoint 15 -verbose -lifeline 32238 -version 624'

[17:13:15]
[17:13:15] *------------------------------*
[17:13:15] Folding@Home Gromacs SMP Core
[17:13:15] Version 1.74 (November 27, 2006)
[17:13:15]
[17:13:15] Preparing to commence simulation
[17:13:15] - Ensuring status. Please wait.
[17:13:15]
[17:13:15] Project: 0 (Run 0, Clone 0, Gen 0)
[17:13:15]
[17:13:15] Error: Could not write local file.  Exiting.
[17:13:20] - Shutting down core
[17:13:32] put
[17:13:32] - Starting from initial work packet
[17:13:32]
[17:13:32] Project: 0 (Run 0, Clone 0, Gen 0)
[17:13:32]
[17:13:32] Error: Could not write local file.  Exiting.
[17:13:37] - Shutting down core
[0]0:Return code = 18
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[17:15:24] CoreStatus = 12 (18)
[17:15:24] Client-core communications error: ERROR 0x12
[17:15:24] Deleting current work unit & continuing...
[17:19:46] - Warning: Could not delete all work unit files (1): Core returned invalid code
[17:19:46] Trying to send all finished work units
[17:19:46] + No unsent completed units remaining.
[17:19:46] - Preparing to get new work unit...
[17:19:46] + Attempting to get work packet
[17:19:46] - Will indicate memory of 16003 MB
[17:19:46] - Connecting to assignment server
[17:19:46] Connecting to http://assign.stanford.edu:8080/
[17:19:47] Posted data.
[17:19:47] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[17:19:47] + News From Folding@Home: Welcome to Folding@Home
[17:19:47] Loaded queue successfully.
[17:19:47] Connecting to http://171.64.65.64:8080/
[17:19:50] Posted data.
[17:19:50] Initial: 0000; - Receiving payload (expected size: 2438512)
[17:19:57] - Downloaded at ~340 kB/s
[17:19:57] - Averaged speed for that direction ~309 kB/s
[17:19:57] + Received work.
[17:19:57] + Closed connections
[17:20:02]
[17:20:02] + Processing work unit
[17:20:02] Work type a1 not eligible for variable processors
[17:20:02] Core required: FahCore_a1.exe
[17:20:02] Core found.
[17:20:02] Working on queue slot 02 [March 16 17:20:02 UTC]
[17:20:02] + Working ...
[17:20:02] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 32238 -version 624'

[17:20:02]
[17:20:02] *------------------------------*
[17:20:02] Folding@Home Gromacs SMP Core
[17:20:02] Version 1.74 (November 27, 2006)
[17:20:02]
[17:20:02] Preparing to commence simulation
[17:20:02] - Ensuring status. Please wait.
[17:20:19] - Looking at optimizations...
[17:20:19] - Working with standard loops on this execution.
[17:20:19] - Previous termination of core was improper.
[17:20:19] - Going to use standard loops.
[17:20:19] - Files status OK
[17:20:19] Starting from initial work packet
[17:20:19]
[17:20:19] Project: 2653 (Run 36, Cl- Starting from initial work packet
[17:20:19]
[17:20:19] Project: 265Entering M.D.
[17:20:19] ne 17, Gen 134)
[17:20:19]
[17:20:20] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=computenode
NNODES=4, MYRANK=1, HOSTNAME=computenode
NNODES=4, MYRANK=3, HOSTNAME=computenode
NNODES=4, MYRANK=2, HOSTNAME=computenode
NODEID=0 argc=15
NODEID=1 argc=15
NODEID=3 argc=15
NODEID=2 argc=15
      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2004, The GROMACS development team,
            check out http://www.gromacs.org for more information.

        This inclusion of Gromacs code in the Folding@Home Core is under
        a special license (see http://folding.stanford.edu/gromacs.html)
         specially granted to Stanford by the copyright holders. If you
          are interested in using Gromacs, visit www.gromacs.org where
                you can download a free version of Gromacs under
         the terms of the GNU General Public License (GPL) as published
       by the Free Software Foundation; either version 2 of the License,
                     or (at your option) any later version.

[17:20:26] Protein: Protein in POPC
[17:20:26] Writing local files
starting mdrun 'Protein in POPC'
500000 steps,   1000.0 ps.

[17:20:27] boost OK.
[17:20:27] boost OK.
[17:20:27] cal files
[17:20:27] Completed 0 out of 500000 steps  (0 percent)
[17:33:54] Writing local files
[17:33:54] Completed 5000 out of 500000 steps  (1 percent)
[17:47:15] Writing local files
[17:47:15] Completed 10000 out of 500000 steps  (2 percent)
[18:00:43] Writing local files
[18:00:43] Completed 15000 out of 500000 steps  (3 percent)
[18:14:12] Writing local files
[18:14:12] Completed 20000 out of 500000 steps  (4 percent)
[18:27:40] Writing local files
[18:27:40] Completed 25000 out of 500000 steps  (5 percent)
I guess my question is what's Project: 0 (Run 0, Clone 0, Gen 0)?

There's no entry core Core Status: 12 (18).

Note also there's no assignment or attempted connection to assignment server prior to starting P0R0C0G0.
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Post by 7im »

0.0.0.0 is nothing. It's a place holder, and a way of indicating there are either no work units available for your configuration, or one of the other reasons listed in the wiki.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Post by alpha754293 »

7im wrote:0.0.0.0 is nothing. It's a place holder, and a way of indicating there are either no work units available for your configuration, or one of the other reasons listed in the wiki.
Well, for me, there's a HUGE difference between no address at all and 0.0.0.0 or even 127.0.0.1. Even as a placeholder.

I'm sure that we can debate the semantics some other time, but the point is that I did not see any entries in the log pertaining to assignment prior to it's attempt to start working on Project: 0 (Run 0, Clone 0, Gen 0).

Which, either means that it is a legitmate WU (however unlikely) or that there was something wrong since the last WU finish that it couldn't do anything for 3 hours and no status message that states that the client has gone into a holding pattern, which may be indicative of a larger, systemic issue (either with the hardware, as it is possible), or with the client, or with the server, or with the completion of the previous WU.

That's like saying NaN or Inf. or -Inf. = 1.E-30, or 0. There's a HUGE difference between those. In any case, what's P0R0C0G0? What's core status 12?

*edit*
Where's the line in the log file that says that the assignment server is 0.0.0.0?
anandhanju
Posts: 522
Joined: Mon Dec 03, 2007 4:33 am
Location: Australia

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Post by anandhanju »

I vaguely remember seeing a report like this sometime last year. Not a Linux user so not really sure of this. I think Project: 0 (Run 0, Clone 0, Gen 0) refers to a corrupt queue entry. In your case, slot 1 happened to contain the WU that got stuck while finalizing results and this may have gummed up the queue. When you restarted the client, the queue entry was found to be invalid and deleted. I'm pretty sure the results at slot 01 were lost.

Edit: Related posts: viewtopic.php?f=44&t=4321 and viewtopic.php?f=19&t=2869
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Post by alpha754293 »

Hmm...that's weird because it said that the WU finished and then it couldn't send the results and/or the autosend didn't pick up on it. *shrug*

*edit*

Thanks for the links.
Zagen30
Posts: 823
Joined: Tue Mar 25, 2008 12:45 am
Hardware configuration: Core i7 3770K @3.5 GHz (not folding), 8 GB DDR3 @2133 MHz, 2xGTX 780 @1215 MHz, Windows 7 Pro 64-bit running 7.3.6 w/ 1xSMP, 2xGPU

4P E5-4650 @3.1 GHz, 64 GB DDR3 @1333MHz, Ubuntu Desktop 13.10 64-bit

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Post by Zagen30 »

This happened to me today, except that I hadn't finished a WU (I had closed in the middle of one which was running normally) and it did this on every attempt to restart the client. Eventually I found my way to the wiki (there is an entry for corestatus 12 (18): http://fahwiki.net/index.php/CoreStatus_codes#12), saw that that status is due to issues with the queue, and let the client delete the old stuff and download a new WU (just waited a few minutes, didn't delete any files). I lost 64% of a 2653, but I'll live.

BTW, I had not touched the config file, nor had I touched any other files that related to the Linux client. I had even closed the client earlier in the WU and it had restarted just fine. Could this be cause due to restarting a virtual Linux box too soon after ctrl-c-ing out of the client?
Image
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Post by alpha754293 »

Zagen30 wrote:This happened to me today, except that I hadn't finished a WU (I had closed in the middle of one which was running normally) and it did this on every attempt to restart the client. Eventually I found my way to the wiki (there is an entry for corestatus 12 (18): http://fahwiki.net/index.php/CoreStatus_codes#12), saw that that status is due to issues with the queue, and let the client delete the old stuff and download a new WU (just waited a few minutes, didn't delete any files). I lost 64% of a 2653, but I'll live.

BTW, I had not touched the config file, nor had I touched any other files that related to the Linux client. I had even closed the client earlier in the WU and it had restarted just fine. Could this be cause due to restarting a virtual Linux box too soon after ctrl-c-ing out of the client?
Oh...lol. I must have overlooked it. oops. my bad.

I lost 100% and it went "numb" after 3 hours so I had to CTRL+C outta there and restart the client manually after running it unsupervised for quite some time.
bruce
Posts: 20822
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Post by bruce »

Project 0, run 0, clone 0, gen 0 is a WU that isn't there. In rare cases when the previous WU has an error, the queue is updated to point to the next position before the WU is downloaded and then some kind of error happens that makes the client believe that something is there that should be processed. The FahCore is unable to process it, of course, and the client moves on to download a new assignment. [You've already figured most of this out yourself.]

The actual cause of this phantom WU has never been clearly identified but if you look at your log, the previous WU never finished and you had to kill FAH before it successfully moved on to the next WU. Upon restart, it found that phantom WU.

@7im:
Sorry, but you're thinking about the server at IP address 0.0.0.0 which has nothing to do with a WU with PRCG = 0, 0, 0, 0.
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Post by 7im »

Hmmm.... sounds like a new wiki entry coming so I don't confuse them again... ;)
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Post by alpha754293 »

Do you know why or what would cause this? I had my system running for about 14 days straight without any problems until now. And I only noticed it when I saw that the FahMon didn't seem to be updating that client like it should, so that's when I started checking the logs.
bruce
Posts: 20822
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Post by bruce »

The critical message in the FAHlog that you posted is "Shutting down core" That message should always be followed by
"Folding@home Core Shutdown: FINISHED_UNIT

Folding@home Core Shutdown: FINISHED_UNIT
CoreStatus = 64 (100)
Sending work to server"

and in your case, that didn't happen. (You already said as much in an earlier post.) At that point, if you had checked, I suspect that at least one copy of FahCore_a1 was still running. Once your system is hung in that condition, what happens next, including the bogus WU is a result of the initial problem. In other words, you can ignore WU 0,0,0,0 because it's not the problem, the system hang is the problem.

In the "known bugs" list you'll find several reasons why FahCore_a1 hangs, but most notably it's probably a change in your network, including DHCP renewing an address, a WiFi connection going out of range, etc. Some later versions of the Linux Nucleus (and Windows Vista, for that matter) contain a new IP stack which fix this problem but somebody else will have to tell you which ones. One work-around that MIGHT help is to use a fixed IP address on your LAN but I can't promise that will work.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Post by alpha754293 »

bruce wrote:The critical message in the FAHlog that you posted is "Shutting down core" That message should always be followed by
"Folding@home Core Shutdown: FINISHED_UNIT

Folding@home Core Shutdown: FINISHED_UNIT
CoreStatus = 64 (100)
Sending work to server"

and in your case, that didn't happen. (You already said as much in an earlier post.) At that point, if you had checked, I suspect that at least one copy of FahCore_a1 was still running. Once your system is hung in that condition, what happens next, including the bogus WU is a result of the initial problem. In other words, you can ignore WU 0,0,0,0 because it's not the problem, the system hang is the problem.

In the "known bugs" list you'll find several reasons why FahCore_a1 hangs, but most notably it's probably a change in your network, including DHCP renewing an address, a WiFi connection going out of range, etc. Some later versions of the Linux Nucleus (and Windows Vista, for that matter) contain a new IP stack which fix this problem but somebody else will have to tell you which ones. One work-around that MIGHT help is to use a fixed IP address on your LAN but I can't promise that will work.
Actually, no. I checked it. Wait. correction. I don't know. I only checked it after CTRL+C to make sure that there are no <defunct> processes still lingering.

Ran into it again just a few minutes ago.

There shouldn't be any changes in the network config. If there were, it would require a power outage since the remainder of the system has an uptime of 15 days on the same address.

AFAIK, I don't think that the IP stack has changed nor the DHCP assignments.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Post by alpha754293 »

Does the a1 core require external network for the MPICH to function?

i.e. in core-to-core communications, does it loopback via external network (i.e. sees the cores as \\<ip_address>\cpu0 etc. and that all communications between cores must go through the IP MAC interface, or is it local distributed MPICH? (no external network required for core-to-core communications)

in traditional HPC applications, and especially in larger installations, all MPICH communications are external to the system. There are probably controls and managers to try and keep as much of as local as possible, but that's also one of the big reasons why IB and Myrinet is so popular because core-to-core communications may not necessarily exist on the local system anymore. If you have a monolithic OS installation, the OS will ennumerate all cores, but it does not take into consideration the physical location/gap between cores.

I'm just wondering if the Fah a1 core is similiar in that respect.

From your reply, if a change in network configuration is sufficient to cause the core to freeze; without any abort, error, or ABT codes; then I would tend to think that it is coded like the HPC model, which should also mean that the F@H client is actually capable of distributed parallel processing provided that the monolithic OS installation is transparent to F@H.
Post Reply