Page 1 of 1

10127 (Run 48, Clone 1, Gen 121)

Posted: Mon Mar 04, 2013 11:17 am
by woschl
This WU seems to have problems communicating at the 50% point. The folding thread seems to continue or hang using 50 % of the idle CPU cycles, the other 50 % seems to be the communicating attempts, which hogs down the computer.

Code: Select all

08:32:57:WU01:FS00:0xa3:Completed 920000 out of 2000000 steps  (46%)
10:35:14:WU01:FS00:0xa3:Completed 940000 out of 2000000 steps  (47%)
13:27:56:WU01:FS00:0xa3:Completed 960000 out of 2000000 steps  (48%)
******************************** Date: 03/03/13 ********************************
15:43:44:WU01:FS00:0xa3:Completed 980000 out of 2000000 steps  (49%)
17:44:36:Server connection id=2 on 0.0.0.0:36330 from 127.0.0.1
17:44:37:Server connection id=1 ended
18:58:45:WU01:FS00:0xa3:Completed 1000000 out of 2000000 steps  (50%)
19:00:17:Server connection id=3 on 0.0.0.0:36330 from 127.0.0.1
19:00:27:Server connection id=4 on 0.0.0.0:36330 from 127.0.0.1
19:00:38:Server connection id=5 on 0.0.0.0:36330 from 127.0.0.1
19:00:49:Server connection id=6 on 0.0.0.0:36330 from 127.0.0.1
19:01:00:Server connection id=7 on 0.0.0.0:36330 from 127.0.0.1
19:01:11:Server connection id=8 on 0.0.0.0:36330 from 127.0.0.1
19:01:22:Server connection id=9 on 0.0.0.0:36330 from 127.0.0.1
19:01:33:Server connection id=10 on 0.0.0.0:36330 from 127.0.0.1
19:01:44:Server connection id=11 on 0.0.0.0:36330 from 127.0.0.1
there is nothing unusual going on on this client. I did a reboot.

i did look at the status of the servers but all seem to be OK.

there is a switch to an error

Code: Select all

07:27:35:Server connection id=1505 on 0.0.0.0:36330 from 127.0.0.1
07:27:46:Server connection id=1506 on 0.0.0.0:36330 from 127.0.0.1
07:27:57:Server connection id=1507 on 0.0.0.0:36330 from 127.0.0.1
07:28:08:Server connection id=1508 on 0.0.0.0:36330 from 127.0.0.1
07:28:19:ERROR:Exception: Error creating thread
07:28:24:ERROR:Exception: Error creating thread
07:28:29:ERROR:Exception: Error creating thread
07:28:34:ERROR:Exception: Error creating thread
07:28:39:ERROR:Exception: Error creating thread
I've only completed about 60 WUs and this is my first problem, so if you need more information or i'm in the wrong place, please let me know.

woschl

Configuration:

Code: Select all

*********************** Log Started 2013-03-04T02:54:44Z ***********************
02:54:44:************************* Folding@home Client *************************
02:54:44:      Website: http://folding.stanford.edu/
02:54:44:    Copyright: (c) 2009-2012 Stanford University
02:54:44:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
02:54:44:         Args: --lifeline 3380 --command-port=36330
02:54:44:       Config: C:/Users/woschl/AppData/Roaming/FAHClient/config.xml
02:54:44:******************************** Build ********************************
02:54:44:      Version: 7.2.9
02:54:44:         Date: Oct 3 2012
02:54:44:         Time: 18:05:48
02:54:44:      SVN Rev: 3578
02:54:44:       Branch: fah/trunk/client
02:54:44:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
02:54:44:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
02:54:44:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
02:54:44:     Platform: win32 XP
02:54:44:         Bits: 32
02:54:44:         Mode: Release
02:54:44:******************************* System ********************************
02:54:44:          CPU: AMD Athlon(tm) II X2 240e Processor
02:54:44:       CPU ID: AuthenticAMD Family 16 Model 6 Stepping 2
02:54:44:         CPUs: 2
02:54:44:       Memory: 3.75GiB
02:54:44:  Free Memory: 2.10GiB
02:54:44:      Threads: WINDOWS_THREADS
02:54:44:   On Battery: false
02:54:44:   UTC offset: 1
02:54:44:          PID: 3628
02:54:44:          CWD: C:/Users/woschl/AppData/Roaming/FAHClient
02:54:44:           OS: Windows 7 Home Premium
02:54:44:      OS Arch: AMD64
02:54:44:         GPUs: 1
02:54:44:        GPU 0: UNSUPPORTED: RS880 [Radeon HD 4200]
02:54:44:         CUDA: Not detected
02:54:44:Win32 Service: false
02:54:44:***********************************************************************
02:54:45:<config>
02:54:45:  <!-- Folding Slot Configuration -->
02:54:45:  <gpu v='true'/>
02:54:45:
02:54:45:  <!-- Network -->
02:54:45:  <proxy v=':8080'/>
02:54:45:
02:54:45:  <!-- User Information -->
02:54:45:  <user v='SemiAnonymousWolf'/>
02:54:45:
02:54:45:  <!-- Folding Slots -->
02:54:45:  <slot id='0' type='SMP'/>
02:54:45:</config>
02:54:45:Trying to access database...
02:54:45:Successfully acquired database lock
02:54:45:Enabled folding slot 00: READY smp:2
02:54:45:WU01:FS00:Starting
02:54:45:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/woschl/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/Core_a3.fah/FahCore_a3.exe -dir 01 -suffix 01 -version 702 -lifeline 3628 -checkpoint 15 -np 2
02:54:45:WU01:FS00:Started FahCore on PID 3780
02:54:45:WU01:FS00:Core PID:3348
02:54:45:WU01:FS00:FahCore 0xa3 started
02:54:47:Server connection id=1 on 0.0.0.0:36330 from 127.0.0.1
02:54:47:WU01:FS00:0xa3:
02:54:47:WU01:FS00:0xa3:*------------------------------*
02:54:47:WU01:FS00:0xa3:Folding@Home Gromacs SMP Core
02:54:47:WU01:FS00:0xa3:Version 2.27 (Dec. 15, 2010)
02:54:47:WU01:FS00:0xa3:
02:54:47:WU01:FS00:0xa3:Preparing to commence simulation
02:54:47:WU01:FS00:0xa3:- Ensuring status. Please wait.
02:54:56:WU01:FS00:0xa3:- Looking at optimizations...
02:54:56:WU01:FS00:0xa3:- Working with standard loops on this execution.
02:54:56:WU01:FS00:0xa3:- Previous termination of core was improper.
02:54:56:WU01:FS00:0xa3:- Files status OK
02:54:57:WU01:FS00:0xa3:- Expanded 2039537 -> 3061060 (decompressed 150.0 percent)
02:54:57:WU01:FS00:0xa3:Called DecompressByteArray: compressed_data_size=2039537 data_size=3061060, decompressed_data_size=3061060 diff=0
02:54:57:WU01:FS00:0xa3:- Digital signature verified
02:54:57:WU01:FS00:0xa3:
02:54:57:WU01:FS00:0xa3:Project: 10127 (Run 48, Clone 1, Gen 121)
02:54:57:WU01:FS00:0xa3:
02:54:57:WU01:FS00:0xa3:Entering M.D.
02:55:03:WU01:FS00:0xa3:Using Gromacs checkpoints
02:55:03:WU01:FS00:0xa3:Mapping NT from 2 to 2 
02:55:07:WU01:FS00:0xa3:Resuming from checkpoint
02:55:07:WU01:FS00:0xa3:Verified 01/wudata_01.log
02:55:07:WU01:FS00:0xa3:Verified 01/wudata_01.trr
02:55:07:WU01:FS00:0xa3:Verified 01/wudata_01.xtc
02:55:07:WU01:FS00:0xa3:Verified 01/wudata_01.edr
02:55:08:WU01:FS00:0xa3:Completed 1018212 out of 2000000 steps  (50%)
02:55:20:Server connection id=2 on 0.0.0.0:36330 from 127.0.0.1
02:55:31:Server connection id=3 on 0.0.0.0:36330 from 127.0.0.1

Re: 10127 (Run 48, Clone 1, Gen 121)

Posted: Mon Mar 04, 2013 5:26 pm
by PantherX
Welcome to the F@H Forum woschl,

Please note that this isn't a bad WU since it has been already completed by another donor. However, since this is the first problem you have encountered on your system, it can just be a random event so you can ignore it. However, if this happens frequently, then it might be an indication that something is wrong with your system.

Re: 10127 (Run 48, Clone 1, Gen 121)

Posted: Mon Mar 04, 2013 5:54 pm
by Joe_H
I don't know if this is related to the problem you are seeing, but I have seen a similar connection issue when using the V7 client on OS X. The software has a limit on how many connections can be open simultaneously, so after that it will not accept any more. In your selection of the error messages I am only seeing connections being made, none being closed. The cause on my system was a bug in a third-party monitoring app, it kept opening new connections and eventually used up all that were available. The only cure I found in my case was to completely shutdown folding and restart it. In some cases it took a reboot. So that is what I can suggest to clear this up.

As for the source of the problem on your Windows system, there have been some reports of the connections being lost between the active components of the F@H software in the past. It does not happen often, so the investigation has had limited examples to determine where was a bug and how to fix it. There was a ticket on that, if I can find it I will add that to this post.

Re: 10127 (Run 48, Clone 1, Gen 121)

Posted: Mon Mar 04, 2013 7:12 pm
by woschl
Joe_H wrote:. The cause on my system was a bug in a third-party monitoring app, it kept opening new connections and eventually used up all that were available. The only cure I found in my case was to completely shutdown folding and restart it.
good lead. i had to kill the FAH Client and closed MS's ProcessExplorer. After restarting FAH everything runs as expected, for now. Strange that rebooting didnt clear the problem, but maybe startup of the monitoring app before FAH prevented something to reset.

I will keep an eye on it with regard to the monitoring app and communications.

woschl

Re: 10127 (Run 48, Clone 1, Gen 121)

Posted: Mon Mar 04, 2013 8:09 pm
by bruce
You're running V7.2.9 rather than the latest which is V7.3.6. I have no idea if that matters, but it is something you can try if the problem comes back.

It does resemble Joe_H's problem on OS-X in that for some unknown reason, connections are being created at a rapid rate. Based on my experience with older versions, a new connection is created when FAHControl extablishes communications with FAHClient. If I connect from both a local FAHControl and a remote FAHControl, I get two connections and they get closed if I stop the applicable copy of FAHControl. I presume the same is true with WebControl.

If we assume that some 3rd party monitoring tool is also running, it may be logging on to FAHClient and not logging off, thereby eating up all the connections. Unfortunately I don't know any way to prove or disprove that guess. The messages don't provide enough detail to tell.

Kevin: Do stale connections time out eventually?

Re: 10127 (Run 48, Clone 1, Gen 121)

Posted: Mon Mar 04, 2013 8:17 pm
by P5-133XL
Just as a independent question, he has an Core_A3 and is running Windows. How is that possible with V7.x? I thought A3's on Windows needed MPI that was abandoned after v6? I know that Linux uses Core_A3's but he's running XP.

Just a peculiarity that I noticed.

Re: 10127 (Run 48, Clone 1, Gen 121)

Posted: Mon Mar 04, 2013 8:51 pm
by bruce
The Windows MPI code was used initially but it was unreliable. The A3 core was rewritten to use the functionality of Windows Threads.

I don't remember the exact sequence of events nor can I state categorically what happened in Linux/OS-X but I think it probably still uses MPI since that's a fundamental capability of most Distros. That may also be related to the outcome of core A5, but I know even less about it.

The code did get adopted by a later version of Gromacs which then got used in Core A4.

I have no reason to believe this has anything at all to do with the telnet connections being establish between different software components.

Re: 10127 (Run 48, Clone 1, Gen 121)

Posted: Mon Mar 04, 2013 9:38 pm
by calxalot
To the best of my knowledge, the client does not clean up stale connections unless there is a heartbeat/update scheduled. The client doesn't know the connection is gone until it tries to send data on the socket and fails.

If a third-party app disconnects and does not send a 'quit' command or have updates scheduled, then it can use up the available socket connections in the client.

It may be possible that FAHControl can open a connection and an error occurs before it can setup the heartbeat. Not sure.

There should probably be a ticket for the client to always close inactive connections. A 1 minute timeout might be reasonable, since apps should typically setup a heartbeat for 5 seconds if they want to keep the connection open.

Edit: a relevant ticket is 932

It's important to know if a third-party monitoring app is being used, or if it is FAHControl that somehow causes this.

Re: 10127 (Run 48, Clone 1, Gen 121)

Posted: Mon Mar 04, 2013 10:15 pm
by bruce
In ticket #932, it's apparent that WUdget is being used. The website for WUdget says it is no longer being developed Nevertheless, WUdget 1.5.7k contains the problem and (beta 1.5.7L does not) so the best guess is to have everybody use revision L. Unfortunately Stanford cannot be responsible for 3rd party apps. Does anybody know how to contact the author and politely ask him to stop distributing revision K?

Re: 10127 (Run 48, Clone 1, Gen 121)

Posted: Tue Mar 05, 2013 7:34 am
by calxalot
F@H WUdget is for OSX, so not relevant in this case.

The question is if there are any other apps that might have the same issue, or if it is FAHControl that is causing this somehow.

Re: 10127 (Run 48, Clone 1, Gen 121)

Posted: Wed Mar 06, 2013 9:21 am
by woschl
It happened again, but this time without any monitoring app running that i'm aware of (except AV), around the 66% mark. Next i will update to 7.3.6 unless it's not reccomended switching versions during an incomplete WU.

woschl

Re: 10127 (Run 48, Clone 1, Gen 121)

Posted: Wed Mar 06, 2013 3:31 pm
by Joe_H
In this case it looks like the problem is within one of the components of the F@H client, and that is triggering the bug to run out of connections. Updating may help. Most persons have not had any problems with switching versions in the middle of a WU, so you should be okay. There are some changes between 7.2.9 and 7.3.6 due to the change in focus on default system usage, I would recommend reading the release notes and the posts on the forum on the newer version if you haven't already.

Re: 10127 (Run 48, Clone 1, Gen 121)

Posted: Thu Mar 07, 2013 6:59 am
by Jesse_V
Joe_H wrote:There are some changes between 7.2.9 and 7.3.6 due to the change in focus on default system usage, I would recommend reading the release notes and the posts on the forum on the newer version if you haven't already.
The software FAQs in http://folding.stanford.edu/English/FAQ woud be an excellent place to start too.