Page 1 of 1

7004 (Run 3, Clone 303, Gen 76) Linux client/core reconnects

Posted: Sat Dec 01, 2012 4:29 pm
by Ken_g6
That topic line is way too short.

Anyway, I'm running the v7 client on Linux (and the v6 GPU client separately if that matters.) Since this WU started, my client continually tries to reconnect to my core, using 100% of one CPU core, and fails most of the time:

Code: Select all

13:09:15:WU01:FS00:Starting
13:09:15:WU00:FS00:Connecting to 171.67.108.59:8080
13:09:15:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /media/hdd/home/ken/.FAHClient/cores/www.stanford.edu/~pande/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 01 -suffix 01 -version 702 -lifeline 17065 -checkpoint 15 -np 4
13:09:15:WU01:FS00:Started FahCore on PID 14083
13:09:15:WU01:FS00:Core PID:14087
13:09:15:WU01:FS00:FahCore 0xa4 started
13:09:16:WU01:FS00:0xa4:
13:09:16:WU01:FS00:0xa4:*------------------------------*
13:09:16:WU01:FS00:0xa4:Folding@Home Gromacs GB Core
13:09:16:WU01:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
13:09:16:WU01:FS00:0xa4:
13:09:16:WU01:FS00:0xa4:Preparing to commence simulation
13:09:16:WU01:FS00:0xa4:- Looking at optimizations...
13:09:16:WU01:FS00:0xa4:- Created dyn
13:09:16:WU01:FS00:0xa4:- Files status OK
13:09:16:WU01:FS00:0xa4:- Expanded 39691 -> 204900 (decompressed 516.2 percent)
13:09:16:WU01:FS00:0xa4:Called DecompressByteArray: compressed_data_size=39691 data_size=204900, decompressed_data_size=204900 diff=0
13:09:16:WU01:FS00:0xa4:- Digital signature verified
13:09:16:WU01:FS00:0xa4:
13:09:16:WU01:FS00:0xa4:Project: 7004 (Run 3, Clone 303, Gen 76)
13:09:16:WU01:FS00:0xa4:
13:09:16:WU01:FS00:0xa4:Assembly optimizations on if available.
13:09:16:WU01:FS00:0xa4:Entering M.D.
13:09:21:WU00:FS00:Upload 32.61%
13:09:21:WU01:FS00:0xa4:Completed 0 out of 10000000 steps  (0%)
13:09:27:WU00:FS00:Upload 65.22%
13:09:33:WU00:FS00:Upload 97.83%
13:09:35:WU00:FS00:Upload complete
13:09:35:WU00:FS00:Server responded WORK_ACK (400)
13:09:35:WU00:FS00:Final credit estimate, 975.00 points
13:09:35:WU00:FS00:Cleaning up
13:09:46:Server connection id=2 on 0.0.0.0:36330 from 127.0.0.1
13:09:56:Server connection id=3 on 0.0.0.0:36330 from 127.0.0.1
13:10:07:Server connection id=4 on 0.0.0.0:36330 from 127.0.0.1
13:10:18:Server connection id=5 on 0.0.0.0:36330 from 127.0.0.1
13:10:28:Server connection id=6 on 0.0.0.0:36330 from 127.0.0.1
13:10:39:Server connection id=7 on 0.0.0.0:36330 from 127.0.0.1
13:10:50:Server connection id=8 on 0.0.0.0:36330 from 127.0.0.1
13:11:00:Server connection id=9 on 0.0.0.0:36330 from 127.0.0.1
13:11:11:Server connection id=10 on 0.0.0.0:36330 from 127.0.0.1
13:11:21:Server connection id=11 on 0.0.0.0:36330 from 127.0.0.1
13:11:32:Server connection id=12 on 0.0.0.0:36330 from 127.0.0.1
13:11:43:Server connection id=13 on 0.0.0.0:36330 from 127.0.0.1
13:11:53:Server connection id=14 on 0.0.0.0:36330 from 127.0.0.1
13:12:04:Server connection id=15 on 0.0.0.0:36330 from 127.0.0.1
13:12:15:Server connection id=16 on 0.0.0.0:36330 from 127.0.0.1
13:12:25:Server connection id=17 on 0.0.0.0:36330 from 127.0.0.1
13:12:36:Server connection id=18 on 0.0.0.0:36330 from 127.0.0.1
13:12:46:Server connection id=19 on 0.0.0.0:36330 from 127.0.0.1
13:12:57:Server connection id=20 on 0.0.0.0:36330 from 127.0.0.1
13:13:08:Server connection id=21 on 0.0.0.0:36330 from 127.0.0.1
13:13:18:Server connection id=22 on 0.0.0.0:36330 from 127.0.0.1
13:13:29:Server connection id=23 on 0.0.0.0:36330 from 127.0.0.1
13:13:40:Server connection id=24 on 0.0.0.0:36330 from 127.0.0.1
13:13:50:Server connection id=25 on 0.0.0.0:36330 from 127.0.0.1
13:14:01:Server connection id=26 on 0.0.0.0:36330 from 127.0.0.1
13:14:11:Server connection id=27 on 0.0.0.0:36330 from 127.0.0.1
13:14:22:Server connection id=28 on 0.0.0.0:36330 from 127.0.0.1
13:14:33:Server connection id=29 on 0.0.0.0:36330 from 127.0.0.1
13:14:43:Server connection id=30 on 0.0.0.0:36330 from 127.0.0.1
13:14:54:Server connection id=31 on 0.0.0.0:36330 from 127.0.0.1
13:15:05:Server connection id=32 on 0.0.0.0:36330 from 127.0.0.1
13:15:15:Server connection id=33 on 0.0.0.0:36330 from 127.0.0.1
13:15:26:Server connection id=34 on 0.0.0.0:36330 from 127.0.0.1
13:15:37:Server connection id=35 on 0.0.0.0:36330 from 127.0.0.1
13:15:47:Server connection id=36 on 0.0.0.0:36330 from 127.0.0.1
13:15:58:Server connection id=37 on 0.0.0.0:36330 from 127.0.0.1
13:16:08:Server connection id=38 on 0.0.0.0:36330 from 127.0.0.1
13:16:08:WU01:FS00:0xa4:Completed 100000 out of 10000000 steps  (1%)
13:16:19:Server connection id=39 on 0.0.0.0:36330 from 127.0.0.1
13:16:30:Server connection id=40 on 0.0.0.0:36330 from 127.0.0.1
13:16:40:Server connection id=41 on 0.0.0.0:36330 from 127.0.0.1
13:16:51:Server connection id=42 on 0.0.0.0:36330 from 127.0.0.1
13:17:02:Server connection id=43 on 0.0.0.0:36330 from 127.0.0.1
13:17:12:Server connection id=44 on 0.0.0.0:36330 from 127.0.0.1
13:17:23:Server connection id=45 on 0.0.0.0:36330 from 127.0.0.1
13:17:33:Server connection id=46 on 0.0.0.0:36330 from 127.0.0.1
13:17:44:Server connection id=47 on 0.0.0.0:36330 from 127.0.0.1
13:17:55:Server connection id=48 on 0.0.0.0:36330 from 127.0.0.1
13:18:05:Server connection id=49 on 0.0.0.0:36330 from 127.0.0.1
13:18:16:Server connection id=50 on 0.0.0.0:36330 from 127.0.0.1
13:18:27:Server connection id=51 on 0.0.0.0:36330 from 127.0.0.1
13:18:38:Server connection id=52 on 0.0.0.0:36330 from 127.0.0.1
13:18:48:Server connection id=53 on 0.0.0.0:36330 from 127.0.0.1
13:18:59:Server connection id=54 on 0.0.0.0:36330 from 127.0.0.1
13:19:10:Server connection id=55 on 0.0.0.0:36330 from 127.0.0.1
13:19:20:Server connection id=56 on 0.0.0.0:36330 from 127.0.0.1
13:19:31:Server connection id=57 on 0.0.0.0:36330 from 127.0.0.1
13:19:42:Server connection id=58 on 0.0.0.0:36330 from 127.0.0.1
13:19:52:Server connection id=59 on 0.0.0.0:36330 from 127.0.0.1
13:20:03:Server connection id=60 on 0.0.0.0:36330 from 127.0.0.1
13:20:14:Server connection id=61 on 0.0.0.0:36330 from 127.0.0.1
13:20:24:Server connection id=62 on 0.0.0.0:36330 from 127.0.0.1
13:20:35:Server connection id=63 on 0.0.0.0:36330 from 127.0.0.1
13:20:46:Server connection id=64 on 0.0.0.0:36330 from 127.0.0.1
13:20:56:Server connection id=65 on 0.0.0.0:36330 from 127.0.0.1
13:21:07:Server connection id=66 on 0.0.0.0:36330 from 127.0.0.1
13:21:17:Server connection id=67 on 0.0.0.0:36330 from 127.0.0.1
13:21:28:Server connection id=68 on 0.0.0.0:36330 from 127.0.0.1
13:21:39:Server connection id=69 on 0.0.0.0:36330 from 127.0.0.1
13:21:49:Server connection id=70 on 0.0.0.0:36330 from 127.0.0.1
13:22:00:Server connection id=71 on 0.0.0.0:36330 from 127.0.0.1
13:22:11:Server connection id=72 on 0.0.0.0:36330 from 127.0.0.1
13:22:21:Server connection id=73 on 0.0.0.0:36330 from 127.0.0.1
13:22:32:Server connection id=74 on 0.0.0.0:36330 from 127.0.0.1
13:22:37:WU01:FS00:0xa4:Completed 200000 out of 10000000 steps  (2%)
13:22:42:Server connection id=75 on 0.0.0.0:36330 from 127.0.0.1
13:22:53:Server connection id=76 on 0.0.0.0:36330 from 127.0.0.1
And so on. While this is happening in the log, the client just says "Updating" all the time.

For now I've killed the client with STOP to give the core more room to run. But I'd like a better solution.

Re: 7004 (Run 3, Clone 303, Gen 76) Linux client/core reconn

Posted: Sat Dec 01, 2012 4:38 pm
by P5-133XL
Something is interfering with port 36330 and preventing the client from properly communicating with the cores. What I don't know. I cannot see how a specific WU would cause this.

Re: 7004 (Run 3, Clone 303, Gen 76) Linux client/core reconn

Posted: Sat Dec 01, 2012 4:44 pm
by Ken_g6
Well, it ran the following projects before this one and didn't have connection issues:

Code: Select all

20:00:35:WU00:FS00:0xa4:Project: 8056 (Run 82, Clone 61, Gen 0)
02:47:30:WU01:FS00:0xa4:Project: 8027 (Run 1609, Clone 1, Gen 8)
08:05:42:WU00:FS00:0xa4:Project: 8056 (Run 76, Clone 31, Gen 57)
13:50:56:WU01:FS00:0xa4:Project: 7611 (Run 4, Clone 76, Gen 192)
15:15:19:WU01:FS00:0xa4:Project: 7611 (Run 4, Clone 76, Gen 192)
15:24:02:WU01:FS00:0xa4:Project: 7611 (Run 4, Clone 76, Gen 192)
17:59:01:WU00:FS00:0xa4:Project: 8069 (Run 0, Clone 94, Gen 26)
18:00:35:WU00:FS00:0xa4:Project: 8069 (Run 0, Clone 94, Gen 26)
18:12:04:WU00:FS00:0xa4:Project: 8069 (Run 0, Clone 94, Gen 26)
01:39:11:WU01:FS00:0xa4:Project: 8056 (Run 16, Clone 38, Gen 62)
07:30:51:WU00:FS00:0xa4:Project: 8056 (Run 16, Clone 43, Gen 63)
Hopefully it will go away after this WU. But I'd like to avoid it happening again.

Re: 7004 (Run 3, Clone 303, Gen 76) Linux client/core reconn

Posted: Sat Dec 01, 2012 4:53 pm
by P5-133XL
Not recommended, but you could dump the WU and see if there is a change...

Document before and after by suppling the logs, if you choose to do that so we can make a bug report if that is the cause.

It would also be reasonable to allow others to comment before dumping. Maybe someone else has a solution.

Re: 7004 (Run 3, Clone 303, Gen 76) Linux client/core reconn

Posted: Sat Dec 01, 2012 5:58 pm
by Joe_H
First a couple questions, are you running any third party monitoring tools and how long has your Linux system been running since its last restart? Reason I ask is because I did come across a similar issue with the OS X client a few months ago. There a third party utility would open network connections to FAHClient and eventually use up the open limit under some circumstances. The only cure was to reboot the Mac. The author of that utility did issue an updated version that fixed the problem.

If you are not using another monitoring utility it is possible that FAHControl may have got into a similar state trying to open a connection to FAHClient after a period of uptime. Closing FAHControl will not stop the FAHCore from processing, but killing the FAHClient process should. If you can issue a Restart to the FAHClient, that is what I would recommend as a minimum. Rebooting your system and restarting F@H would also do the same.

Re: 7004 (Run 3, Clone 303, Gen 76) Linux client/core reconn

Posted: Sat Dec 01, 2012 6:07 pm
by bruce
See another report of the same thing here: viewtopic.php?f=85&t=23134

Re: 7004 (Run 3, Clone 303, Gen 76) Linux client/core reconn

Posted: Sat Dec 01, 2012 7:39 pm
by Ken_g6
Joe_H wrote:First a couple questions, are you running any third party monitoring tools and how long has your Linux system been running since its last restart?
No third-party monitoring tools. Uptime is just under 7 days.
Joe_H wrote:If you are not using another monitoring utility it is possible that FAHControl may have got into a similar state trying to open a connection to FAHClient after a period of uptime. Closing FAHControl will not stop the FAHCore from processing, but killing the FAHClient process should.
Well, I did that. First time I did a kill -9; second time I did kill three times. Each time the client stopped, of course, as did the cores. Each time I started over with FAHControl, and launched FAHClient from there. It showed some progress, as did the log file - 25% the first time and 28% the second. But the connection problem came back within a minute.

I'm currently running the core with FAHClient stopped (with "kill -STOP".) I set a timer to run kill -CONT on it near the time when it should be done, and then I plan to run normally from there. Hopefully if I get through this WU the next won't act like this.

Re: 7004 (Run 3, Clone 303, Gen 76) Linux client/core reconn

Posted: Sun Dec 02, 2012 1:57 am
by Ken_g6
Well, the darn thing worked itself out eventually, although I had to restart FAHClient after the WU finished.

Thanks, all!