Page 1 of 4

Project: 7611 (Run 3, Clone 26, Gen 107) Very slow

Posted: Sat Mar 03, 2012 5:39 pm
by shauge
I run this on an I7 2600K which has run as expected up until now.
This WU has a TPF of about 39 min. It is at 11% now and ETA is 2.35 days. PPD for these days are 680.
At first I thought that some of the cores were occupied by other tasks, but I think I have ruled that out now.

I don't have trouble running long lasting WUs, I don't have trouble running low scoreing WUs either, but I don't like low scoreing WUs that take several days to complete.

Is it really necessary to have such a big variance in scoring?

Re: 7611 (3, 26, 107) Very slow

Posted: Sat Mar 03, 2012 5:47 pm
by SomeStones
7611 DOES run slower and the variation in PPD has been reported on several occasions. The last ones I documented got about 75% of the PPD I expected. Other users have reported even slower (or lower PPD) results. PPD varies greatly depending upon the WU and systems it is being run on but once you get comfortable with a system, the slower jobs stand out.

You can find all of the other 7611 comments using the search feature in the top right corner of the page.

Re: 7611 (3, 26, 107) Very slow

Posted: Sat Mar 03, 2012 6:47 pm
by Joe_H
Yes, I have seen 7611 WU's be very variable, but 36 TPF is high. It is pointing to something else taking resources even though you think that is ruled out. Best I can suggest is pausing the unit and restarting your system to make sure nothing else is running. I have also seen these WU's being very sensitive to the slightest amount of thread synchronization issues, so it might be better running at SMP 6 instead of 8.

Edit: Thought came to me later, any chance HT has been disabled on your CPU? OS and system was not mentioned, HT can be disabled with some and would also result in very long frame times when 8 threads try to go on a 4-thread CPU.

Re: 7611 (3, 26, 107) Very slow

Posted: Sat Mar 03, 2012 10:47 pm
by shauge
Thanks Joe_H,

this is Win7 and the v7 client. I do run 2 nvidia 570 clients together with the smp. As I said it has run ok priorly. I checked the log and the wu just prior did run as expected. I stopped the gpu clients with not much difference to the smp client. Reducing from smp 8 to smp 6 only increased the TPF. However, when I rebooted, I lost the smp wu and a new one was started. I don't know why, I closed it properly with the quit command. the new wu runs more as expected again.
It could have been another process not showing in task manager taking cpu power. I'll keep an eye on the system to see if it happens again.

Re: 7611 (3, 26, 107) Very slow

Posted: Sun Mar 04, 2012 4:30 am
by Ravage7779
Seen a couple of these now that have really long frame times for no apparent reason. I dunno, it's normal to get one thats abnormal on occasion with this project?

Re: 7611 (3, 26, 107) Very slow

Posted: Sun Mar 04, 2012 5:31 am
by bruce
Start with the list of projects http://fah-web.stanford.edu/psummary.html

Divide the list into categories: Projects available to the Uniprocessor, Projects available to SMP, Projects available to GPUs.Projects available to bigadv.

Within each category, the BASELINE points should be approximately proportional to the time the WU runs. (That's true on the benchmark machine but it may not be true when there are significant hardware differences.)

Bonus points are NOT included in that statement. The fundamental problem that I see is that everybody assumes that they'll always get a similar bonus, which is simply not true.

The other problem that I see is that many uniprocessor projects can also be assigned to SMP clients. That probably makes a difference too, but I've never researched it.

Re: 7611 (3, 26, 107) Very slow

Posted: Sun Mar 04, 2012 5:49 am
by Ravage7779
True, the bonus is different for every unit, but his time per frame is way out of whack, even for p7611. He's got frame times i normally see on a c2d chip for those units. He should be blowing that chip out of the water, setting it on fire, and reducing it to a set of unrelated molecules. There are some real clunkers in there for some reason that just take foooooooooooooreverrrrrrrr to run. I'd be asking questions too, its certainly far enough out of whack to be asking if something is wrong.

Re: 7611 (3, 26, 107) Very slow

Posted: Sun Mar 04, 2012 6:22 am
by Joe_H
Exactly, something may have been wrong, though unless the unit gets re-assigned to the same machine it may be impossible to find out what. But as an example, on my 2.8 GHz i7-860 the normal TPF for a 7611 varies between just under 8 minutes to around 9 minutes. At the worst I have seen a few 7611's show TPF's of 12-15 minutes, and as best as I can tell nothing else was running as the CPU for the folding core was reported at around 790+%. At 39 minutes on a faster i7, that specific WU is well outside a reasonable expectation based on others in the same project.

Re: 7611 (3, 26, 107) Very slow

Posted: Sun Mar 04, 2012 6:39 am
by bruce
Please post the first few pages from the top of your log (or from the System Info tab).

Also, a screenshot of Task manager with and without SMP running.

Are you trying to run both FAH and BOINC?

Re: 7611 (3, 26, 107) Very slow

Posted: Sun Mar 04, 2012 6:51 pm
by shauge
bruce wrote:Please post the first few pages from the top of your log (or from the System Info tab).

Code: Select all

09:04:13:WU01:FS02:0xa4:Completed 9800000 out of 10000000 steps  (98%)
09:06:19:WU01:FS02:0xa4:Completed 9900000 out of 10000000 steps  (99%)
09:06:20:WU03:FS02:Connecting to assign3.stanford.edu:8080
09:06:21:WU03:FS02:News: Welcome to Folding@Home
09:06:21:WU03:FS02:Assigned to work server 171.64.65.104
09:06:21:WU03:FS02:Requesting new work unit for slot 02: RUNNING smp:8 from 171.64.65.104
09:06:21:WU03:FS02:Connecting to 171.64.65.104:8080
09:06:22:WU03:FS02:Downloading 29.66KiB
09:06:22:WU03:FS02:Download complete
09:06:22:WU03:FS02:Received Unit: id:03 state:DOWNLOAD error:OK project:7611 run:3 clone:26 gen:107 core:0xa4 unit:0x000000a0664f2dd04df0f5b6477c19dd
09:08:26:WU01:FS02:0xa4:Completed 10000000 out of 10000000 steps  (100%)
09:08:26:WU01:FS02:0xa4:DynamicWrapper: Finished Work Unit: sleep=10000
09:08:36:WU01:FS02:0xa4:
09:08:36:WU01:FS02:0xa4:Finished Work Unit:
09:08:36:WU01:FS02:0xa4:- Reading up to 2131908 from "01/wudata_01.trr": Read 2131908
09:08:36:WU01:FS02:0xa4:trr file hash check passed.
09:08:36:WU01:FS02:0xa4:- Reading up to 220724 from "01/wudata_01.xtc": Read 220724
09:08:36:WU01:FS02:0xa4:xtc file hash check passed.
09:08:36:WU01:FS02:0xa4:edr file hash check passed.
09:08:36:WU01:FS02:0xa4:logfile size: 79221
09:08:36:WU01:FS02:0xa4:Leaving Run
09:08:39:WU01:FS02:0xa4:- Writing 2456325 bytes of core data to disk...
09:08:39:WU01:FS02:0xa4:Done: 2455813 -> 1716806 (compressed to 69.9 percent)
09:08:39:WU01:FS02:0xa4:  ... Done.
09:08:39:WU01:FS02:0xa4:- Shutting down core
09:08:39:WU01:FS02:0xa4:
09:08:39:WU01:FS02:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
09:08:39:WU01:FS02:FahCore returned: FINISHED_UNIT (100 = 0x64)
09:08:39:WU01:FS02:Sending unit results: id:01 state:SEND error:OK project:7006 run:1 clone:123 gen:54 core:0xa4 unit:0x000000930001329c4dfb8f06f915c638
09:08:39:WU01:FS02:Uploading 1.64MiB to 129.74.85.15
09:08:39:WU03:FS02:Starting
09:08:39:WU01:FS02:Connecting to 129.74.85.15:8080
09:08:39:WU03:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/ProgramData/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 03 -suffix 01 -version 701 -checkpoint 15 -np 8
09:08:39:WU03:FS02:Started FahCore on PID 5720
09:08:39:WU03:FS02:Core PID:6964
09:08:39:WU03:FS02:FahCore 0xa4 started
09:08:40:WU03:FS02:0xa4:
09:08:40:WU03:FS02:0xa4:*------------------------------*
09:08:40:WU03:FS02:0xa4:Folding@Home Gromacs GB Core
09:08:40:WU03:FS02:0xa4:Version 2.27 (Dec. 15, 2010)
09:08:40:WU03:FS02:0xa4:
09:08:40:WU03:FS02:0xa4:Preparing to commence simulation
09:08:40:WU03:FS02:0xa4:- Looking at optimizations...
09:08:40:WU03:FS02:0xa4:- Created dyn
09:08:40:WU03:FS02:0xa4:- Files status OK
09:08:40:WU03:FS02:0xa4:- Expanded 29855 -> 644556 (decompressed 2158.9 percent)
09:08:40:WU03:FS02:0xa4:Called DecompressByteArray: compressed_data_size=29855 data_size=644556, decompressed_data_size=644556 diff=0
09:08:40:WU03:FS02:0xa4:- Digital signature verified
09:08:40:WU03:FS02:0xa4:
09:08:40:WU03:FS02:0xa4:Project: 7611 (Run 3, Clone 26, Gen 107)
09:08:40:WU03:FS02:0xa4:
09:08:40:WU03:FS02:0xa4:Assembly optimizations on if available.
09:08:40:WU03:FS02:0xa4:Entering M.D.
09:08:45:WU01:FS02:Upload 22.90%
09:08:46:WU03:FS02:0xa4:Mapping NT from 8 to 8 
09:08:46:WU03:FS02:0xa4:Completed 0 out of 2000000 steps  (0%)
09:08:51:WU01:FS02:Upload 57.24%
09:08:57:WU01:FS02:Upload 95.40%
09:08:59:WU01:FS02:Upload complete
09:08:59:WU01:FS02:Server responded WORK_ACK (400)
09:08:59:WU01:FS02:Final credit estimate, 4518.00 points
09:08:59:WU01:FS02:Cleaning up
09:47:44:WU03:FS02:0xa4:Completed 20000 out of 2000000 steps  (1%)
10:27:00:WU03:FS02:0xa4:Completed 40000 out of 2000000 steps  (2%)
The log section I chose was from just before the previous finished, to show that the previous looked to be running ok, and to just the first frames after the troubled one started.
bruce wrote:Also, a screenshot of Task manager with and without SMP running.
The task errored out with "mdrun returned 255" when I restarted the machine.
bruce wrote:Are you trying to run both FAH and BOINC?
No, this machine do only folding.

As I posted earlier, I can't rule out that the machine had a problem since I no longer have the task.

Re: 7611 (3, 26, 107) Very slow

Posted: Fri Mar 09, 2012 12:53 am
by PS3hep48
I'm having the same 39 min TPF issue with WU 7610 (Run 501, Clone 0, Gen 54). Received unit around 2100 UTC and it's completed less than 5 frames since.

Re: Project: 7611 (Run 3, Clone 26, Gen 107) Very slow

Posted: Sun Mar 18, 2012 11:47 pm
by HendricksSA
Add me to the list of people with slow 7611 SMP work units. I had a couple last week and didn't report them. I've got another. Here is a log extract from my computer on the 14th processing a 7611. Normal TPF for this machine is about 18 minutes.

Code: Select all

19:09:33:WU01:FS01:Starting
19:09:33:WU01:FS01:Running FahCore: "C:\Program Files\FAHClient/FAHCoreWrapper.exe" "C:/Documents and Settings/All Users/Application Data/FAHClient/cores/www.stanford.edu/~pande/Win32/x86/Core_a4.fah/FahCore_a4.exe" -dir 01 -suffix 01 -version 701 -lifeline 3256 -checkpoint 15 -np 4
19:09:33:WU01:FS01:Started FahCore on PID 1728
19:09:33:Started thread 14 on PID 3256
19:09:33:WU01:FS01:Core PID:3016
19:09:33:WU01:FS01:FahCore 0xa4 started
19:09:34:WU01:FS01:0xa4:
19:09:34:WU01:FS01:0xa4:*------------------------------*
19:09:34:WU01:FS01:0xa4:Folding@Home Gromacs GB Core
19:09:34:WU01:FS01:0xa4:Version 2.27 (Dec. 15, 2010)
19:09:34:WU01:FS01:0xa4:
19:09:34:WU01:FS01:0xa4:Preparing to commence simulation
19:09:34:WU01:FS01:0xa4:- Looking at optimizations...
19:09:34:WU01:FS01:0xa4:- Created dyn
19:09:34:WU01:FS01:0xa4:- Files status OK
19:09:34:WU01:FS01:0xa4:- Expanded 339028 -> 644556 (decompressed 190.1 percent)
19:09:34:WU01:FS01:0xa4:Called DecompressByteArray: compressed_data_size=339028 data_size=644556, decompressed_data_size=644556 diff=0
19:09:34:WU01:FS01:0xa4:- Digital signature verified
19:09:34:WU01:FS01:0xa4:
19:09:34:WU01:FS01:0xa4:Project: 7611 (Run 0, Clone 9, Gen 138)
19:09:34:WU01:FS01:0xa4:
19:09:34:WU01:FS01:0xa4:Assembly optimizations on if available.
19:09:34:WU01:FS01:0xa4:Entering M.D.
19:09:36:WU00:FS01:Upload complete
19:09:36:WU00:FS01:Server responded WORK_ACK (400)
19:09:36:WU00:FS01:Final credit estimate, 3462.00 points
19:09:36:WU00:FS01:Cleaning up
19:09:40:WU01:FS01:0xa4:Mapping NT from 4 to 4 
19:09:40:WU01:FS01:0xa4:Completed 0 out of 2000000 steps  (0%)
******************************** Date: 14/03/12 ********************************
19:27:43:WU01:FS01:0xa4:Completed 20000 out of 2000000 steps  (1%)
19:45:55:WU01:FS01:0xa4:Completed 40000 out of 2000000 steps  (2%)
20:03:54:WU01:FS01:0xa4:Completed 60000 out of 2000000 steps  (3%)
20:22:05:WU01:FS01:0xa4:Completed 80000 out of 2000000 steps  (4%)
20:40:04:WU01:FS01:0xa4:Completed 100000 out of 2000000 steps  (5%)
20:57:30:WU01:FS01:0xa4:Completed 120000 out of 2000000 steps  (6%)
21:14:35:WU01:FS01:0xa4:Completed 140000 out of 2000000 steps  (7%)
21:31:37:WU01:FS01:0xa4:Completed 160000 out of 2000000 steps  (8%)
Now, here is a filtered log from the 16th with one of the 36 1/2 minute TFP 7611 work units on the same machine. It was restarted with no change in TFP. Task manager shows a4 core with a consistent 99% CPU load. PPD for this quad core machine has been pushed down to about 700. This is the first time I've seen such a huge range in TFP within one project. They do seem to share the same Run/Clone series.

Code: Select all

*********************** Log Started 2012-03-16T21:09:54Z ***********************
21:10:14:FS01:Unpaused
21:10:14:WU02:FS01:Starting
21:10:14:WU02:FS01:Running FahCore: "C:\Program Files\FAHClient/FAHCoreWrapper.exe" "C:/Documents and Settings/All Users/Application Data/FAHClient/cores/www.stanford.edu/~pande/Win32/x86/Core_a4.fah/FahCore_a4.exe" -dir 02 -suffix 01 -version 701 -lifeline 3060 -checkpoint 15 -np 4
21:10:15:WU02:FS01:Started FahCore on PID 3384
21:10:15:WU02:FS01:Core PID:3404
21:10:15:WU02:FS01:FahCore 0xa4 started
21:10:15:WU02:FS01:0xa4:
21:10:15:WU02:FS01:0xa4:*------------------------------*
21:10:15:WU02:FS01:0xa4:Folding@Home Gromacs GB Core
21:10:15:WU02:FS01:0xa4:Version 2.27 (Dec. 15, 2010)
21:10:15:WU02:FS01:0xa4:
21:10:15:WU02:FS01:0xa4:Preparing to commence simulation
21:10:15:WU02:FS01:0xa4:- Looking at optimizations...
21:10:15:WU02:FS01:0xa4:- Files status OK
21:10:15:WU02:FS01:0xa4:- Expanded 29855 -> 644556 (decompressed 2158.9 percent)
21:10:15:WU02:FS01:0xa4:Called DecompressByteArray: compressed_data_size=29855 data_size=644556, decompressed_data_size=644556 diff=0
21:10:15:WU02:FS01:0xa4:- Digital signature verified
21:10:15:WU02:FS01:0xa4:
21:10:15:WU02:FS01:0xa4:Project: 7611 (Run 3, Clone 26, Gen 110)
21:10:15:WU02:FS01:0xa4:
21:10:15:WU02:FS01:0xa4:Assembly optimizations on if available.
21:10:15:WU02:FS01:0xa4:Entering M.D.
21:10:21:WU02:FS01:0xa4:Using Gromacs checkpoints
21:10:21:WU02:FS01:0xa4:Mapping NT from 4 to 4 
21:10:22:WU02:FS01:0xa4:Resuming from checkpoint
21:10:22:WU02:FS01:0xa4:Verified 02/wudata_01.log
21:10:22:WU02:FS01:0xa4:Verified 02/wudata_01.trr
21:10:22:WU02:FS01:0xa4:Verified 02/wudata_01.xtc
21:10:22:WU02:FS01:0xa4:Verified 02/wudata_01.edr
21:10:22:WU02:FS01:0xa4:Completed 394730 out of 2000000 steps  (19%)
21:19:58:WU02:FS01:0xa4:Completed 400000 out of 2000000 steps  (20%)
21:56:26:WU02:FS01:0xa4:Completed 420000 out of 2000000 steps  (21%)
22:32:53:WU02:FS01:0xa4:Completed 440000 out of 2000000 steps  (22%)
23:09:20:WU02:FS01:0xa4:Completed 460000 out of 2000000 steps  (23%)
23:45:47:WU02:FS01:0xa4:Completed 480000 out of 2000000 steps  (24%)
00:22:18:WU02:FS01:0xa4:Completed 500000 out of 2000000 steps  (25%)
00:58:48:WU02:FS01:0xa4:Completed 520000 out of 2000000 steps  (26%)
01:35:15:WU02:FS01:0xa4:Completed 540000 out of 2000000 steps  (27%)

Re: Project: 7611 (Run 3, Clone 26, Gen 107) Very slow

Posted: Mon Mar 19, 2012 7:51 am
by PantherX
I have informed the Project owner about this so let's see what happens.

Re: Project: 7611 (Run 3, Clone 26, Gen 107) Very slow

Posted: Mon Mar 19, 2012 6:48 pm
by tjlane
Hi All,

First, thanks for reporting issues with a specific WU. Based on this information, I was able to track down a problem with a very small subset (< 1%) of p7611 WUs, including (3, 26, 107). I believe I have been able to address most of the immediate issues by simply ending these bad WUs. Please continue to report any issues, especially wild deviations from normal WU behavior.

Thanks!

TJ

Re: Project: 7611 (Run 3, Clone 26, Gen 107) Very slow

Posted: Mon Mar 19, 2012 7:21 pm
by bruce
Can you modify the FahCore so that these <1% of WUs issue an error? Tracking down a problem which doesn't produce an error message is a lot harder to do that one which the software can identify as defective. I'm concerned that what happened with this project will happen again on the next project, even if you have been successful in stopping all of WUs that were bad for this project.