Page 1 of 1

Project 10001 - Core b4, V 20 - EARLY_UNIT_END

Posted: Thu Dec 24, 2009 1:49 pm
by Bob8421
Last night my new ProtoMol work units began downloading a new b4 core v20. Every one of these work units (3 so far) has not only failed to complete because of the same errors, but they have not returned the work units.

Code: Select all

[06:25:51] Loaded queue successfully.
[06:25:51] Connecting to http://129.74.85.48:8080/
[06:25:51] Posted data.
[06:25:51] Initial: 0000; - Receiving payload (expected size: 77871)
[06:25:51] Conversation time very short, giving reduced weight in bandwidth avg
[06:25:51] - Downloaded at ~152 kB/s
[06:25:51] - Averaged speed for that direction ~470 kB/s
[06:25:51] + Received work.
[06:25:51] + Closed connections
[06:25:56] 
[06:25:56] + Processing work unit
[06:25:56] Core required: FahCore_b4.exe
[06:25:56] Core found.
[06:25:56] Working on queue slot 06 [December 24 06:25:56 UTC]
[06:25:56] + Working ...
[06:25:56] - Calling '.\FahCore_b4.exe -dir work/ -suffix 06 -checkpoint 10 -verbose -lifeline 2128 -version 623'

[06:25:59] *********************** Log Started 24/Dec/2009 06:25:59 ***********************
[06:25:59] ************************** ProtoMol Folding@Home Core **************************
[06:25:59]   Version: 20
[06:25:59]      Type: 180
[06:25:59]      Core: ProtoMol
[06:25:59]   Website: http://folding.stanford.edu/
[06:25:59] Copyright: (c) 2009 Stanford University
[06:25:59]    Author: Joseph Coffland <joseph@cauldrondevelopment.com>
[06:25:59]      Args: -dir work/ -suffix 06 -checkpoint 10 -verbose -lifeline 2128 -version
[06:25:59]            623
[06:25:59] ************************************ Build *************************************
[06:25:59]      Date: Dec 23 2009
[06:25:59]      Time: 15:34:02
[06:25:59]  Revision: 1747
[06:25:59]  Compiler: Intel(R) C++ MSVC 1500 mode 1110
[06:25:59]   Options: /TP /nologo /EHsc /wd4297 /wd4103 /wd1786 /arch:IA32 /Ox
[06:25:59]            /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qrestrict /MT
[06:25:59]  Platform: Windows XP
[06:25:59]      Bits: 32
[06:25:59] ************************************ System ************************************
[06:25:59]        OS: Microsoft Windows XP Home Edition
[06:25:59]       CPU: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
[06:25:59]    CPU ID: GenuineIntel Family 6 Model 15 Stepping 11
[06:25:59]      CPUs: 4 Logical, 1 Physical
[06:25:59]    Memory: 3.50 GB
[06:25:59] ********************************************************************************
[06:25:59] Project: 10001 (Run 408, Clone 4, Gen 3)
[06:25:59] Reading tar file par_all27_prot_lipid.inp
[06:25:59] Reading tar file scpismQuartic.inp
[06:26:00] Reading tar file ww_exteq_nowater1.pdb
[06:26:00] Reading tar file ww_exteq_nowater1.psf
[06:26:00] Reading tar file checkpt
[06:26:00] Reading tar file ww_exteq_nowater1.79.pos
[06:26:00] Reading tar file ww_exteq_nowater1.79.vel
[06:26:00] Reading tar file protomol.conf
[06:26:00] Reading tar file core.xml
[06:26:00] Digital signatures verified
[06:26:01] GUI Server started
[06:26:01] ERROR: Exception in thread 1: @ fah\net\Socket.cpp:139:<unknown> 0: Could not bind socket to 127.0.0.1: No error
[06:26:01] Completed 0 out of 200000 steps (0%)
[06:29:38] Completed 2000 out of 200000 steps (1%)
[06:33:39] Completed 4000 out of 200000 steps (2%)
[06:37:49] Completed 6000 out of 200000 steps (3%)
[06:41:59] Completed 8000 out of 200000 steps (4%)
[06:46:05] Completed 10000 out of 200000 steps (5%)
[06:50:14] Completed 12000 out of 200000 steps (6%)
[06:54:17] Completed 14000 out of 200000 steps (7%)
[06:58:25] Completed 16000 out of 200000 steps (8%)
[07:02:37] Completed 18000 out of 200000 steps (9%)
[07:06:49] Completed 20000 out of 200000 steps (10%)
[07:10:54] Completed 22000 out of 200000 steps (11%)
[07:15:16] Completed 24000 out of 200000 steps (12%)
[07:19:43] Completed 26000 out of 200000 steps (13%)
[07:23:45] Completed 28000 out of 200000 steps (14%)
[07:27:44] Completed 30000 out of 200000 steps (15%)
[07:31:44] Completed 32000 out of 200000 steps (16%)
[07:36:11] Completed 34000 out of 200000 steps (17%)
[07:40:33] Completed 36000 out of 200000 steps (18%)
[07:44:36] Completed 38000 out of 200000 steps (19%)
[07:48:26] Completed 40000 out of 200000 steps (20%)
[07:52:37] Completed 42000 out of 200000 steps (21%)
[07:56:52] Completed 44000 out of 200000 steps (22%)
[08:01:22] Completed 46000 out of 200000 steps (23%)
[08:05:26] Completed 48000 out of 200000 steps (24%)
[08:09:49] Completed 50000 out of 200000 steps (25%)
[08:14:04] Completed 52000 out of 200000 steps (26%)
[08:18:24] Completed 54000 out of 200000 steps (27%)
[08:23:12] Completed 56000 out of 200000 steps (28%)
[08:27:22] Completed 58000 out of 200000 steps (29%)
[08:31:57] Completed 60000 out of 200000 steps (30%)
[08:35:59] Completed 62000 out of 200000 steps (31%)
[08:40:18] Completed 64000 out of 200000 steps (32%)
[08:44:16] Completed 66000 out of 200000 steps (33%)
[08:48:33] Completed 68000 out of 200000 steps (34%)
[08:53:09] Completed 70000 out of 200000 steps (35%)
[08:57:24] Completed 72000 out of 200000 steps (36%)
[09:01:25] Completed 74000 out of 200000 steps (37%)
[09:05:37] Completed 76000 out of 200000 steps (38%)
[09:09:38] Completed 78000 out of 200000 steps (39%)
[09:13:56] Completed 80000 out of 200000 steps (40%)
[09:18:05] Completed 82000 out of 200000 steps (41%)
[09:22:11] Completed 84000 out of 200000 steps (42%)
[09:26:12] Completed 86000 out of 200000 steps (43%)
[09:30:09] Completed 88000 out of 200000 steps (44%)
[09:33:54] Completed 90000 out of 200000 steps (45%)
[09:37:44] Completed 92000 out of 200000 steps (46%)
[09:41:23] Completed 94000 out of 200000 steps (47%)
[09:45:23] Completed 96000 out of 200000 steps (48%)
[09:49:31] Completed 98000 out of 200000 steps (49%)
[09:53:42] Completed 100000 out of 200000 steps (50%)
[09:57:54] Completed 102000 out of 200000 steps (51%)
[10:01:50] Completed 104000 out of 200000 steps (52%)
[10:03:50] - Autosending finished units... [December 24 10:03:50 UTC]
[10:03:50] Trying to send all finished work units
[10:03:50] + No unsent completed units remaining.
[10:03:50] - Autosend completed
[10:05:45] Completed 106000 out of 200000 steps (53%)
[10:09:39] Completed 108000 out of 200000 steps (54%)
[10:13:44] Completed 110000 out of 200000 steps (55%)
[10:17:39] Completed 112000 out of 200000 steps (56%)
[10:21:47] Completed 114000 out of 200000 steps (57%)
[10:25:55] Completed 116000 out of 200000 steps (58%)
[10:29:57] Completed 118000 out of 200000 steps (59%)
[10:33:46] Completed 120000 out of 200000 steps (60%)
[10:37:55] Completed 122000 out of 200000 steps (61%)
[10:41:53] Completed 124000 out of 200000 steps (62%)
[10:45:56] Completed 126000 out of 200000 steps (63%)
[10:49:59] Completed 128000 out of 200000 steps (64%)
[10:54:16] Completed 130000 out of 200000 steps (65%)
[10:58:37] Completed 132000 out of 200000 steps (66%)
[11:03:11] Completed 134000 out of 200000 steps (67%)
[11:07:35] Completed 136000 out of 200000 steps (68%)
[11:11:44] Completed 138000 out of 200000 steps (69%)
[11:15:40] Completed 140000 out of 200000 steps (70%)
[11:19:35] Completed 142000 out of 200000 steps (71%)
[11:23:34] Completed 144000 out of 200000 steps (72%)
[11:27:56] Completed 146000 out of 200000 steps (73%)
[11:32:06] Completed 148000 out of 200000 steps (74%)
[11:35:59] Completed 150000 out of 200000 steps (75%)
[11:40:12] Completed 152000 out of 200000 steps (76%)
[11:43:53] Completed 154000 out of 200000 steps (77%)
[11:47:45] Completed 156000 out of 200000 steps (78%)
[11:51:48] Completed 158000 out of 200000 steps (79%)
[11:55:43] Completed 160000 out of 200000 steps (80%)
[11:59:55] Completed 162000 out of 200000 steps (81%)
[12:03:58] Completed 164000 out of 200000 steps (82%)
[12:08:13] Completed 166000 out of 200000 steps (83%)
[12:12:12] Completed 168000 out of 200000 steps (84%)
[12:14:35] WARNING: UnexpectedExitHandler triggered
[12:14:35] WARNING: Unexpected exit from science code
[12:14:35] Saving result file logfile_06.txt
[12:14:35] Saving result file checkpt
[12:14:35] Saving result file checkpt.crc
[12:14:35] Saving result file log.txt
[12:14:35] Saving result file protomol.conf
[12:14:35] Saving result file ww.dcd
[12:14:36] Saving result file ww_exteq_nowater1.113.pos
[12:14:36] Saving result file ww_exteq_nowater1.113.vel
[12:14:36] Folding@home Core Shutdown: EARLY_UNIT_END
[12:14:38] CoreStatus = 79 (121)
[12:14:38] Client-core communications error: ERROR 0x79
[12:14:38] This is a sign of more serious problems, shutting down.

Re: Project 10001 - Core 20 - EARLY_UNIT_END

Posted: Thu Dec 24, 2009 2:08 pm
by Bob8421
Right after posting the above, I had another ProtoMol work unit complete successfully with b4 core v20. It was the only ProtoMol work unit running on that system, together with 3 GROMACS work units. The above 3 error work units were all running on the same system, together with 1 GROMACS work unit. It seems that ProtoMol plays well with others, but not with itself.

Re: Project 10001 - Core 20 - EARLY_UNIT_END

Posted: Thu Dec 24, 2009 3:04 pm
by Tobit
Looks like a bad WU. In the future, could you please format the subject to include the project number with the R/C/G info such as the following in the case of this unit:

Project: 10001 (Run 408, Clone 4, Gen 3) - EARLY_END_UNIT

This makes it much easier on the developers to spot and address. :)

Re: Project 10001 - Core 20 - EARLY_UNIT_END

Posted: Thu Dec 24, 2009 3:25 pm
by Bob8421
I didn't include those details in the subject because there were 3 separate clients all with the same error. And the details are listed in the log file I included for one of the work units should that prove relevant. But it does not appear to me that the problem is with the those particular work units, but with the new core itself, which is why I put that in the subject instead.

By the way, the subject line was not entirely my creation. I stopped after "Core 20", but the " - EARLY_UNIT_END" showed up all by itself!

Re: Project 10001 - Core 20 - EARLY_UNIT_END

Posted: Thu Dec 24, 2009 3:30 pm
by Tobit
Noted Bob, thanks for clarifying. For the record, I've done several of them with the new version 20 core without issue so I didn't think there were core wide issues with version 20. Definitely worth investigating however.

Re: Project 10001 - Core 20 - EARLY_UNIT_END

Posted: Thu Dec 24, 2009 4:39 pm
by Bob8421
Following up on the 3 errors with the v20 core...

The EARLY_UNIT_END message led me to believe that the work units had failed and that the partial results should have been returned. I was planning to leave those 3 clients closed rather than taking a chance on getting additional ProtoMol work units that would not finish.

However, a short time ago I decided to restart them anyway just to see what work units would be assigned. None of them requested a new work unit. Instead they all picked up the existing work units that had been in progress at the same points where they were when the clients crashed. I can't wait to see if they can actually run to completion!

Re: Project 10001 - Core 20 - EARLY_UNIT_END

Posted: Fri Dec 25, 2009 1:53 am
by jcoffland
Sorry about the problems v21 is out which should fix this problem.

Re: Project 10001 - Core 20 - EARLY_UNIT_END

Posted: Fri Dec 25, 2009 2:36 pm
by codysluder
Bob8421 wrote:However, a short time ago I decided to restart them anyway just to see what work units would be assigned. None of them requested a new work unit. Instead they all picked up the existing work units that had been in progress at the same points where they were when the clients crashed. I can't wait to see if they can actually run to completion!
I've only had one like that, but it finished without any problems.