Page 1 of 1

Work Unit 7610 (12, 0, 224) appears very slow

Posted: Wed Apr 10, 2013 5:02 am
by RLHay
I'm running this work unit on a Win 7 i5 machine that generally produces on the order of 5000 ppd depending on the mix of WU's. A typical range would be 4500-5500.

This work unit is going to take a total of over 5.5 days and generate ~1260 points or ~220 ppd. At 8% completed the TPF is 1hr 21min. Has anyone else seen results like this?

System details:

Code: Select all

02:49:14:******************************** Build ********************************
02:49:14:      Version: 7.3.6
02:49:14:         Date: Feb 18 2013
02:49:14:         Time: 15:25:17
02:49:14:      SVN Rev: 3923
02:49:14:       Branch: fah/trunk/client
02:49:14:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
02:49:14:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
02:49:14:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
02:49:14:     Platform: win32 XP
02:49:14:         Bits: 32
02:49:14:         Mode: Release
02:49:14:******************************* System ********************************
02:49:14:          CPU: Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz
02:49:14:       CPU ID: GenuineIntel Family 6 Model 37 Stepping 2
02:49:14:         CPUs: 4
02:49:14:       Memory: 3.80GiB
02:49:14:  Free Memory: 952.01MiB
02:49:14:      Threads: WINDOWS_THREADS
02:49:14:  Has Battery: false
02:49:14:   On Battery: false
02:49:14:   UTC offset: 12
02:49:14:          PID: 8284
02:49:14:          CWD: C:/Users/X/AppData/Roaming/FAHClient
02:49:14:           OS: Windows 7 Professional
02:49:14:      OS Arch: AMD64
02:49:14:         GPUs: 0
02:49:14:         CUDA: Not detected
02:49:14:Win32 Service: false

Re: Work Unit 7610 (12, 0, 224) appears very slow

Posted: Wed Apr 10, 2013 5:47 am
by P5-133XL
Check task manager for non FAHCore_xx CPU usage. Specifically, FAHControl which can spontaneously start using a full core (the solution is to hut it down and restart it). But it does not have to be that, it could be any process that uses a significant amount of CPU usage (10%+ is plenty to cause this syndrome).

Folding is running at a very low Priority so that it does not interfere with your usage of the machine. Thus, it is designed to suspend threads whenever something else requests CPU usage. If one core thread gets suspended by something else running, the other threads, being highly synchronized, sit there idly waiting for that one thread to come back. So if anything else is trying to use the CPU the SMP TPF will drastically increase...

If this happens regularly, you can force SMP:3 in Advanced-Control leaving some CPU resource available for usage by other applications.

Re: Work Unit 7610 (12, 0, 224) appears very slow

Posted: Wed Apr 10, 2013 6:09 am
by RLHay
P5-133XL wrote:Check task manager for non FAHCore_xx CPU usage. Specifically, FAHControl which can spontaneously start using a full core (the solution is to hut it down and restart it). But it does not have to be that, it could be any process that uses a significant amount of CPU usage (10%+ is plenty to cause this syndrome).

Folding is running at a very low Priority so that it does not interfere with your usage of the machine. Thus, it is designed to suspend threads whenever something else requests CPU usage. If one core thread gets suspended by something else running, the other threads, being highly synchronized, sit there idly waiting for that one thread to come back. So if anything else is trying to use the CPU the SMP TPF will drastically increase...

If this happens regularly, you can force SMP:3 in Advanced-Control leaving some CPU resource available for usage by other applications.
Thanks for the tip P5-133XL. This is my work machine and it really only gets used for some light typing, spreadsheets and general officy stuff. FAHControl is behaving itself and FAHcore is currently running at an average of 95-98% of combined core activity. Nothing else pops above 3% and that only for brief moments. I'll keep an eye out for the issue you describe, but I don't think that's it.

Cheers

Re: Work Unit 7610 (12, 0, 224) appears very slow

Posted: Wed Apr 10, 2013 7:18 pm
by Skripka
P5-133XL wrote:Check task manager for non FAHCore_xx CPU usage. Specifically, FAHControl which can spontaneously start using a full core (the solution is to hut it down and restart it). But it does not have to be that, it could be any process that uses a significant amount of CPU usage (10%+ is plenty to cause this syndrome).

Folding is running at a very low Priority so that it does not interfere with your usage of the machine. Thus, it is designed to suspend threads whenever something else requests CPU usage. If one core thread gets suspended by something else running, the other threads, being highly synchronized, sit there idly waiting for that one thread to come back. So if anything else is trying to use the CPU the SMP TPF will drastically increase...

If this happens regularly, you can force SMP:3 in Advanced-Control leaving some CPU resource available for usage by other applications.
I've noticed this symptom increasingly with SMP units, especially the heavier-duty lower PPD units like 761* and 1013x and 764x...where TPF is unstable in WUs. I site the example below in this thread instead of starting a new issue because it isn't limited to just this particular WU in the OP or the example below....I have seen these symptoms sporadically over a dozen times this calendar year alone.

For example I have a 10138 (92,6,0) SMP unit on my e3-1230 (Xeon equivalent of an i7-2600 non-K) Windows7 64 bit machine, v7 client. Machine is not overclocked, and is otherwise perfectly stable, system is watercooled and well withing normal temp ranges. It was folding that same WU fine all last night unattended, check at 7AM still going fine...check again at 1030AM TPF has gone from 12-14min TPF (range due to using the machine in an attempt to keep the unit folding right) to 30 minute TPF. Every time I've seen this on an SMP unit the fix is to just move the mouse and open up resource monitor-or just do something on the machine and F@H will snap back to normal and the TPF will normalize, CPU utilization stays at 99%. Restarting the client doesn't clear it or reduce the symptoms anymore on the WUs that exhibit these symptoms that simply moving the mouse. And before you ask, the screen saver is off and other units like the 80XX WUs don't do this....it is only limited to these really big TPF SMP WUs.

That same SMP unit I just mentioned is still folding now, I left it unattended over lunch after using the machine from 1030-11 or so, and the TPF flipped back to 30 minutes for the last 3 hours. In typing this post, it has gone back to normal.

Code: Select all

01:15:18:WU01:FS00:0xa3:Project: 10138 (Run 92, Clone 6, Gen 0)
01:15:18:WU01:FS00:0xa3:
01:15:18:WU01:FS00:0xa3:Assembly optimizations on if available.
01:15:18:WU01:FS00:0xa3:Entering M.D.
01:15:24:WU00:FS00:Upload 20.88%
01:15:24:WU01:FS00:0xa3:Mapping NT from 8 to 8 
01:15:24:WU01:FS00:0xa3:Completed 0 out of 2000000 steps  (0%)
01:15:30:WU00:FS00:Upload 44.36%
01:15:36:WU00:FS00:Upload 62.63%
01:15:42:WU00:FS00:Upload 83.51%
01:15:47:WU00:FS00:Upload complete
01:15:47:WU00:FS00:Server responded WORK_ACK (400)
01:15:47:WU00:FS00:Final credit estimate, 11461.00 points
01:15:47:WU00:FS00:Cleaning up
01:27:07:WU01:FS00:0xa3:Completed 20000 out of 2000000 steps  (1%)
01:38:45:WU01:FS00:0xa3:Completed 40000 out of 2000000 steps  (2%)
01:50:22:WU01:FS00:0xa3:Completed 60000 out of 2000000 steps  (3%)
02:02:02:WU01:FS00:0xa3:Completed 80000 out of 2000000 steps  (4%)
02:13:38:WU01:FS00:0xa3:Completed 100000 out of 2000000 steps  (5%)
02:25:20:WU01:FS00:0xa3:Completed 120000 out of 2000000 steps  (6%)
02:36:59:WU01:FS00:0xa3:Completed 140000 out of 2000000 steps  (7%)
02:48:41:WU01:FS00:0xa3:Completed 160000 out of 2000000 steps  (8%)
03:00:26:WU01:FS00:0xa3:Completed 180000 out of 2000000 steps  (9%)
03:12:06:WU01:FS00:0xa3:Completed 200000 out of 2000000 steps  (10%)
03:23:53:WU01:FS00:0xa3:Completed 220000 out of 2000000 steps  (11%)
03:35:34:WU01:FS00:0xa3:Completed 240000 out of 2000000 steps  (12%)
******************************** Date: 10/04/13 ********************************
03:47:16:WU01:FS00:0xa3:Completed 260000 out of 2000000 steps  (13%)
03:59:01:WU01:FS00:0xa3:Completed 280000 out of 2000000 steps  (14%)
04:10:40:WU01:FS00:0xa3:Completed 300000 out of 2000000 steps  (15%)
04:22:25:WU01:FS00:0xa3:Completed 320000 out of 2000000 steps  (16%)
04:34:07:WU01:FS00:0xa3:Completed 340000 out of 2000000 steps  (17%)
04:45:47:WU01:FS00:0xa3:Completed 360000 out of 2000000 steps  (18%)
04:57:31:WU01:FS00:0xa3:Completed 380000 out of 2000000 steps  (19%)
05:09:19:WU01:FS00:0xa3:Completed 400000 out of 2000000 steps  (20%)
05:21:01:WU01:FS00:0xa3:Completed 420000 out of 2000000 steps  (21%)
05:34:54:WU01:FS00:0xa3:Completed 440000 out of 2000000 steps  (22%)
05:46:35:WU01:FS00:0xa3:Completed 460000 out of 2000000 steps  (23%)
05:58:21:WU01:FS00:0xa3:Completed 480000 out of 2000000 steps  (24%)
06:10:02:WU01:FS00:0xa3:Completed 500000 out of 2000000 steps  (25%)
06:21:43:WU01:FS00:0xa3:Completed 520000 out of 2000000 steps  (26%)
06:33:32:WU01:FS00:0xa3:Completed 540000 out of 2000000 steps  (27%)
06:45:13:WU01:FS00:0xa3:Completed 560000 out of 2000000 steps  (28%)
06:56:58:WU01:FS00:0xa3:Completed 580000 out of 2000000 steps  (29%)
07:08:42:WU01:FS00:0xa3:Completed 600000 out of 2000000 steps  (30%)
07:20:26:WU01:FS00:0xa3:Completed 620000 out of 2000000 steps  (31%)
07:35:54:WU01:FS00:0xa3:Completed 640000 out of 2000000 steps  (32%)
07:53:24:WU01:FS00:0xa3:Completed 660000 out of 2000000 steps  (33%)
08:05:10:WU01:FS00:0xa3:Completed 680000 out of 2000000 steps  (34%)
08:17:00:WU01:FS00:0xa3:Completed 700000 out of 2000000 steps  (35%)
08:28:54:WU01:FS00:0xa3:Completed 720000 out of 2000000 steps  (36%)
08:40:42:WU01:FS00:0xa3:Completed 740000 out of 2000000 steps  (37%)
08:52:34:WU01:FS00:0xa3:Completed 760000 out of 2000000 steps  (38%)
09:04:26:WU01:FS00:0xa3:Completed 780000 out of 2000000 steps  (39%)
09:16:17:WU01:FS00:0xa3:Completed 800000 out of 2000000 steps  (40%)
09:28:11:WU01:FS00:0xa3:Completed 820000 out of 2000000 steps  (41%)
09:39:58:WU01:FS00:0xa3:Completed 840000 out of 2000000 steps  (42%)
******************************** Date: 10/04/13 ********************************
09:51:49:WU01:FS00:0xa3:Completed 860000 out of 2000000 steps  (43%)
10:03:38:WU01:FS00:0xa3:Completed 880000 out of 2000000 steps  (44%)
10:15:25:WU01:FS00:0xa3:Completed 900000 out of 2000000 steps  (45%)
10:27:16:WU01:FS00:0xa3:Completed 920000 out of 2000000 steps  (46%)
10:39:05:WU01:FS00:0xa3:Completed 940000 out of 2000000 steps  (47%)
10:50:53:WU01:FS00:0xa3:Completed 960000 out of 2000000 steps  (48%)
11:02:42:WU01:FS00:0xa3:Completed 980000 out of 2000000 steps  (49%)
11:14:33:WU01:FS00:0xa3:Completed 1000000 out of 2000000 steps  (50%)
11:26:27:WU01:FS00:0xa3:Completed 1020000 out of 2000000 steps  (51%)
11:38:16:WU01:FS00:0xa3:Completed 1040000 out of 2000000 steps  (52%)
11:50:07:WU01:FS00:0xa3:Completed 1060000 out of 2000000 steps  (53%)
12:02:00:WU01:FS00:0xa3:Completed 1080000 out of 2000000 steps  (54%)
12:13:49:WU01:FS00:0xa3:Completed 1100000 out of 2000000 steps  (55%)
12:25:43:WU01:FS00:0xa3:Completed 1120000 out of 2000000 steps  (56%)
12:37:33:WU01:FS00:0xa3:Completed 1140000 out of 2000000 steps  (57%)
12:49:23:WU01:FS00:0xa3:Completed 1160000 out of 2000000 steps  (58%)
13:01:19:WU01:FS00:0xa3:Completed 1180000 out of 2000000 steps  (59%)
13:14:05:WU01:FS00:0xa3:Completed 1200000 out of 2000000 steps  (60%)
13:26:53:WU01:FS00:0xa3:Completed 1220000 out of 2000000 steps  (61%)
13:40:14:WU01:FS00:0xa3:Completed 1240000 out of 2000000 steps  (62%)
13:53:35:WU01:FS00:0xa3:Completed 1260000 out of 2000000 steps  (63%)
14:05:30:WU01:FS00:0xa3:Completed 1280000 out of 2000000 steps  (64%)
14:17:27:WU01:FS00:0xa3:Completed 1300000 out of 2000000 steps  (65%)
14:31:27:WU01:FS00:0xa3:Completed 1320000 out of 2000000 steps  (66%)
15:01:51:WU01:FS00:0xa3:Completed 1340000 out of 2000000 steps  (67%)
15:30:30:WU01:FS00:0xa3:Completed 1360000 out of 2000000 steps  (68%)
******************************** Date: 10/04/13 ********************************
15:52:22:WU01:FS00:0xa3:Completed 1380000 out of 2000000 steps  (69%)
16:05:40:WU01:FS00:0xa3:Completed 1400000 out of 2000000 steps  (70%)
16:18:05:WU01:FS00:0xa3:Completed 1420000 out of 2000000 steps  (71%)
16:30:15:WU01:FS00:0xa3:Completed 1440000 out of 2000000 steps  (72%)
16:42:15:WU01:FS00:0xa3:Completed 1460000 out of 2000000 steps  (73%)
16:55:15:WU01:FS00:0xa3:Completed 1480000 out of 2000000 steps  (74%)
17:22:38:WU01:FS00:0xa3:Completed 1500000 out of 2000000 steps  (75%)
17:34:39:WU01:FS00:0xa3:Completed 1520000 out of 2000000 steps  (76%)
17:46:47:WU01:FS00:0xa3:Completed 1540000 out of 2000000 steps  (77%)
18:10:02:WU01:FS00:0xa3:Completed 1560000 out of 2000000 steps  (78%)
18:39:30:WU01:FS00:0xa3:Completed 1580000 out of 2000000 steps  (79%)
18:54:12:WU01:FS00:0xa3:Completed 1600000 out of 2000000 steps  (80%)
19:06:54:WU01:FS00:0xa3:Completed 1620000 out of 2000000 steps  (81%)

Re: Work Unit 7610 (12, 0, 224) appears very slow - Updated

Posted: Fri Apr 12, 2013 2:02 am
by RLHay
P5-133XL wrote:Check task manager for non FAHCore_xx CPU usage. Specifically, FAHControl which can spontaneously start using a full core (the solution is to hut it down and restart it). But it does not have to be that, it could be any process that uses a significant amount of CPU usage (10%+ is plenty to cause this syndrome).

Folding is running at a very low Priority so that it does not interfere with your usage of the machine. Thus, it is designed to suspend threads whenever something else requests CPU usage. If one core thread gets suspended by something else running, the other threads, being highly synchronized, sit there idly waiting for that one thread to come back. So if anything else is trying to use the CPU the SMP TPF will drastically increase...

If this happens regularly, you can force SMP:3 in Advanced-Control leaving some CPU resource available for usage by other applications.
So, to update. I just tried setting folding power to medium using the slider. It crashed the WU. Setting the slider back to full resulting in a new WU being downloaded 7611 (3, 30, 353) - one that is giving me the expected ~5000ppd on this machine rather than ~220.

I've appended the relevant part of the log file for anyone who is interested. I note that the 7610 WU is flagged as bad. I'm not sure if this is because of the crash, or because of whatever was causing the very slow folding in the first place.

Code: Select all

******************************* Date: 2013-04-11 *******************************
19:16:24:WU00:FS00:0xa4:Completed 720000 out of 2000000 steps  (36%)
20:36:32:WU00:FS00:0xa4:Completed 740000 out of 2000000 steps  (37%)
21:58:20:WU00:FS00:0xa4:Completed 760000 out of 2000000 steps  (38%)
23:19:33:WU00:FS00:0xa4:Completed 780000 out of 2000000 steps  (39%)
00:41:01:WU00:FS00:0xa4:Completed 800000 out of 2000000 steps  (40%)
******************************* Date: 2013-04-12 *******************************
01:48:17:FS00:Shutting core down
01:48:19:WU00:FS00:0xa4:Client no longer detected. Shutting down core 
01:48:19:WU00:FS00:0xa4:
01:48:19:WU00:FS00:0xa4:Folding@home Core Shutdown: CLIENT_DIED
01:48:19:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
01:48:20:WU00:FS00:Starting
01:48:20:WARNING:WU00:FS00:Changed SMP threads from 4 to 3 this can cause some work units to fail
01:48:20:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/X/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 00 -suffix 01 -version 703 -lifeline 8284 -checkpoint 30 -np 3
01:48:20:WU00:FS00:Started FahCore on PID 7136
01:48:20:WU00:FS00:Core PID:7896
01:48:20:WU00:FS00:FahCore 0xa4 started
01:48:21:WU00:FS00:0xa4:
01:48:21:WU00:FS00:0xa4:*------------------------------*
01:48:21:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
01:48:21:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
01:48:21:WU00:FS00:0xa4:
01:48:21:WU00:FS00:0xa4:Preparing to commence simulation
01:48:21:WU00:FS00:0xa4:- Looking at optimizations...
01:48:21:WU00:FS00:0xa4:- Files status OK
01:48:21:WU00:FS00:0xa4:- Expanded 29813 -> 644556 (decompressed 2161.9 percent)
01:48:21:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=29813 data_size=644556, decompressed_data_size=644556 diff=0
01:48:21:WU00:FS00:0xa4:- Digital signature verified
01:48:21:WU00:FS00:0xa4:
01:48:21:WU00:FS00:0xa4:Project: 7610 (Run 12, Clone 0, Gen 224)
01:48:21:WU00:FS00:0xa4:
01:48:21:WU00:FS00:0xa4:Assembly optimizations on if available.
01:48:21:WU00:FS00:0xa4:Entering M.D.
01:48:27:WU00:FS00:0xa4:Using Gromacs checkpoints
01:48:27:WU00:FS00:0xa4:Mapping NT from 3 to 3 
01:48:27:WU00:FS00:0xa4:Resuming from checkpoint
01:48:27:WU00:FS00:0xa4:Verified 00/wudata_01.log
01:48:27:WU00:FS00:0xa4:Verified 00/wudata_01.trr
01:48:27:WU00:FS00:0xa4:Verified 00/wudata_01.xtc
01:48:27:WU00:FS00:0xa4:Verified 00/wudata_01.edr
01:48:28:WU00:FS00:0xa4:mdrun returned 255
01:48:28:WU00:FS00:0xa4:Going to send back what have done -- stepsTotalG=2000000
01:48:28:WU00:FS00:0xa4:Work fraction=0.0000 steps=2000000.
01:48:32:WU00:FS00:0xa4:logfile size=27635 infoLength=27635 edr=0 trr=25
01:48:32:WU00:FS00:0xa4:logfile size: 27635 info=27635 bed=0 hdr=25
01:48:32:WU00:FS00:0xa4:- Writing 28173 bytes of core data to disk...
01:48:32:WU00:FS00:0xa4:Done: 27661 -> 5596 (compressed to 20.2 percent)
01:48:32:WU00:FS00:0xa4:  ... Done.
01:48:33:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
01:48:33:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:7610 run:12 clone:0 gen:224 core:0xa4 unit:0x0000014f664f2dd04de6d29249ba5a43
01:48:33:WU00:FS00:Uploading 5.96KiB to 171.64.65.104
01:48:33:WU00:FS00:Connecting to 171.64.65.104:8080
01:48:33:WU01:FS00:Connecting to assign3.stanford.edu:8080
01:48:54:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
01:48:54:WU00:FS00:Connecting to 171.64.65.104:80
01:48:54:WU00:FS00:Upload 100.00%
01:48:55:WARNING:WU01:FS00:Failed to get assignment from 'assign3.stanford.edu:8080': Failed to connect to assign3.stanford.edu:8080: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
01:48:55:WU01:FS00:Connecting to assign4.stanford.edu:80
01:48:55:WU00:FS00:Upload complete
01:48:55:WU00:FS00:Server responded WORK_ACK (400)
01:48:55:WU00:FS00:Cleaning up