Page 1 of 2

Unusually low PPD on project 16927?

Posted: Fri Dec 18, 2020 8:26 am
by Hopfgeist
Hi there,

normally I get some variation in PPD from different projects, and that is expected, because not all CPUs are identical, but on my main CPU folding system (dual Xeon X5675) I almost always get between 85,000 and 115,000 PPD.

However, Project 16927 consistently gives just around 45,000 PDD, which is quite a big outlier.

Are there other people experiencing this, or is the benchmark for this project known to be skewed?

I don't worry too much about it, just curious, because one other machine (single Xeon E5-1428L v2), which is, on all other benchmarks, roughly half as fast, is working on Project 17216 and is achieving 75,000 PD compared to its normal 45,000--50,000.

Just curious,
HG.

Re: Unusually low PPD on project 16927?

Posted: Fri Dec 18, 2020 8:57 am
by Neil-B
Have you checked your logs? .. not sure if there is a thread limit on that project .. the log will show if the as/client is running a lower thread count wu .. this can happen if the project has a thread limit and there aren't any WUs that will fully use your kit.

There are some variable ppd projects around at the moment which are peaking high or low depending on the kit in play.

Re: Unusually low PPD on project 16927?

Posted: Fri Dec 18, 2020 9:26 am
by Hopfgeist
Neil-B wrote:Have you checked your logs? .. not sure if there is a thread limit on that project .. the log will show if the as/client is running a lower thread count wu .. this can happen if the project has a thread limit and there aren't any WUs that will fully use your kit.

There are some variable ppd projects around at the moment which are peaking high or low depending on the kit in play.
Thanks for the reply. Yes, I checked, and it was running on all 24 threads.

Code: Select all

04:42:14:WU01:FS00:0xa7:Project: 16927 (Run 24, Clone 226, Gen 4)
04:42:14:WU01:FS00:0xa7:Unit: 0x00000000000000000000000000000000
04:42:14:WU01:FS00:0xa7:Reading tar file core.xml
04:42:14:WU01:FS00:0xa7:Reading tar file frame4.tpr
04:42:14:WU01:FS00:0xa7:Digital signatures verified
04:42:14:WU01:FS00:0xa7:Calling: mdrun -s frame4.tpr -o frame4.trr -cpt 15 -nt 24
04:42:15:WU01:FS00:0xa7:Steps: first=2000000 total=500000
04:42:17:WU01:FS00:0xa7:Completed 1 out of 500000 steps (0%)
04:44:23:WU01:FS00:0xa7:Completed 5000 out of 500000 steps (1%)
04:46:29:WU01:FS00:0xa7:Completed 10000 out of 500000 steps (2%)
So it was started with "-nt 24", and CPU monitoring tools confirm that it actually runs 24 threads.

I have never seen a WU assigned to me using fewer than the advertised number of cores, however I have frequently seen a message that no WUs were available for my configuration, but that is expected occasionally.

(As noted before in another post, I get almost identical PPD whether running on all 24 CPU threads, or just one folding thread per physical CPU core, and this WU was no exception when I stopped it and restarted with reduced thread count.)

The specific work unit was Project: 16927 (Run 24, Clone 226, Gen 4), which has since been successfully uploaded, but not yet credited.

As I said, not to worry too much, just making sure nothing is broken, or my machine isn't somehow acting weird after a small kernel upgrade.


Cheers,
HG.

Re: Unusually low PPD on project 16927?

Posted: Fri Dec 18, 2020 1:07 pm
by Maddog
"Are there other people experiencing this, or is the benchmark for this project known to be skewed?"

Yes, had a couple of those, my 8 thread intel cpu needs over 4 Hours to complete these worrk units. PPD is under half the normal average 60,000 PPD.

Just checked the one that finished and uploaded earlier this morning (14. 684. 4) : not found.

Re: Unusually low PPD on project 16927?

Posted: Sat Dec 19, 2020 6:01 pm
by DrBB1
I just came to the forum to check on this very issue. Am running on a 10-year old PC, where I usually earn about 8000-9000 PPD on my WUs. This WU (project:16927 run:7 clone:956 gen:3) earned about 2500 PPD, and took over a day to complete (total points: 2793).

Re: Unusually low PPD on project 16927?

Posted: Sat Dec 19, 2020 6:04 pm
by Joe_H
This should be fixed now for WUs assigned today and onwards. A setting was changed a couple days ago as part of the project being moved to a new server and has been corrected.

Re: Unusually low PPD on project 16927?

Posted: Sat Dec 19, 2020 7:22 pm
by bruce
Hopfgeist wrote: (As noted before in another post, I get almost identical PPD whether running on all 24 CPU threads, or just one folding thread per physical CPU core, and this WU was no exception when I stopped it and restarted with reduced thread count.)
This is not surprising. Modern CPUs typically are designed so that two IPUs share the resources of one FPU. This reduces the cost of the chip and makes use of the fact that for "normal" computer use, the FPU is under-utilized. FAH isn't "normal" since it depends mostly on the throughput of the FPU.

A second factor: You may also be experiencing thermal limiting. If your typical clock rate is below the rated boost speed, you MIGHT get more throughput by upgrading your cooling subsystem. Processors often are designed to accommodate brief speed excursions above the average clock rate as long as the temperature rise is brief enough. Then they can advertise a speed that's above what can actually be achieved long-term.

Re: Unusually low PPD on project 16927?

Posted: Sun Dec 20, 2020 7:22 am
by DrBB1
Joe_H wrote:This should be fixed now for WUs assigned today and onwards. A setting was changed a couple days ago as part of the project being moved to a new server and has been corrected.
Got another one 5 hours ago. Currently running at estimated 2805 PPD, about one-third of what I normally earn. Problem is not fixed. [Project: 16927 (Run 17, Clone 52, Gen 48)]

UPDATE: About halfway through this WU, the time per frame was cut in half; estimated PPD now back to a reasonable (though still below average) 7000+ PPD. After 16 hours, its still only 85% finished, but progressing.

Re: Unusually low PPD on project 16927?

Posted: Mon Dec 21, 2020 4:54 am
by psaam0001
I also had a few work units cross my path, with an unusually short completion due by time (less than others that I have had before).

They are:

16927 (19, 311, 3)
16927 (29, 358, 4)
16927 (28, 444, 4)
16927 (20, 324, 3)
16927 (30, 333, 3)

My intent is to let them expire, as I did get a different WU from this project that was more consistent with the previous completion due by time frames.

Paul

Re: Unusually low PPD on project 16927?

Posted: Mon Dec 21, 2020 2:00 pm
by Hopfgeist
psaam0001 wrote:I also had a few work units cross my path, with an unusually short completion due by time (less than others that I have had before).

They are:

16927 (19, 311, 3)
16927 (29, 358, 4)
16927 (28, 444, 4)
16927 (20, 324, 3)
16927 (30, 333, 3)

My intent is to let them expire, as I did get a different WU from this project that was more consistent with the previous completion due by time frames.

Paul
You mean these work units have a shorter due-by time than other work units from the same project? That would be highly unusual, and indicative of an error.

At least as currently listed on the summary page, project 16927 has a reasonable timeout (2 days), and an unusually long deadline (20 days). The latter presumably because it is not a disease-related project, and thus not considered urgent or otherwise high-priority.

Otherwise, if your machine can handle it within that timeframe, I strongly recommend to have your machine work on them. Work units within one project/run/clone combination are sequential, and the next "gen" work unit depends on the previous one. Letting them expire will block process on that chain for until expiry.

Cheers,
HG.

Re: Unusually low PPD on project 16927?

Posted: Mon Dec 21, 2020 4:34 pm
by Joe_H
For a short time the 16927 WUs were getting a timeout of 3 days and a final deadline of 5. That happened to be the settings for some other projects being relocated to the new servers. The change in final deadline resulted in lower PPD and final bonus credit. The project dates back a while to when final deadlines for lower priority projects were longer, and was benchmarked as such. To get comparable credit the project would need to be benchmarked again with a different deadline.

Re: Unusually low PPD on project 16927?

Posted: Mon Dec 21, 2020 4:59 pm
by psaam0001
The majority of those work units I had been getting, had an expiration time of 20 days from when I received them to start folding. However, the ones I mentioned had a much shorter time to complete.

It's not that I was complaining about the points, I was trying to see if I just caught the last batch of WU's that were sent out before the server change.

Paul

Re: Unusually low PPD on project 16927?

Posted: Sat Feb 13, 2021 4:53 pm
by zotric
Update following my original report (below)
I've stopped using the CPU now anyway so I'm leaving this here just in case it's useful.

Summary (following the original report in which the thread count seemed to be ignored for project 16927 (7, 1513, 27)):
1. When the PC was rebooted the number of cores was found to be running per the thread setting in FAHControl and performance was OK.
2. Errors continued to be reported in the log.
3. Project 16927 was eventually abandoned by the system (because of the error count?)

Detail:
Following my original report, below, I restarted the PC and the core usage went up correctly to the setting in FAHControl (-1).
A shock because all 28 threads started and the temperature hit 90 degrees!
I don't think it has behaved this way before - I thought it responded straight away when the number of threads was changed in FAHControl.
Turned down the thread count to 6 and restarted again - not wanting the CPU to melt.
Then I found that the WU for project 16927 (Unspecified, Temple University) had been abandoned and a new one started (17423 - Myosins, Washington University in St. Louis).
Logging had stopped.
WU for project 17423 completed, apparently successfully.

Logs show error 0x40010004, the thread count being set back to 3 followed by more errors - I think I set it to -1 but the system seems to have set it back to 3.
Then there is an error which is repeated several times.
This is mixed up with the CUDA Core x22 starting or re-starting which seems unrelated. I have not seen any errors with the gpu.

15:27:12:WARNING:WU00:FS00:FahCore crashed with Windows unhandled exception code 0x40010004, searching for this code online may provide more information
15:27:12:WARNING:WU00:FS00:FahCore returned: UNKNOWN_ENUM (1073807364 = 0x40010004)
15:27:12:WARNING:WU01:FS01:FahCore crashed with Windows unhandled exception code 0x40010004, searching for this code online may provide more information
15:27:12:WARNING:WU01:FS01:FahCore returned: UNKNOWN_ENUM (1073807364 = 0x40010004)
15:27:12:WU00:FS00:Starting
15:27:12:WARNING:WU00:FS00:AS lowered CPUs from 27 to 3
15:27:12:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\david\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/win/64bit-avx-256/a7-0.0.19/Core_a7.fah/FahCore_a7.exe -dir 00 -suffix 01 -version 706 -lifeline 4116 -checkpoint 6 -np 3
15:27:12:WU00:FS00:Started FahCore on PID 46188
15:27:12:WU00:FS00:FahCore 0xa7 started
15:27:13:WARNING:WU00:FS00:FahCore returned an unknown error code which probably indicates that it crashed
15:27:13:WARNING:WU00:FS00:FahCore returned: UNKNOWN_ENUM (-1073741205 = 0xc000026b)
15:27:13:WU01:FS01:Starting
15:27:13:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\david\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.13/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 706 -lifeline 4116 -checkpoint 6 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
15:27:13:WU01:FS01:Started FahCore on PID 49052
15:27:13:WU01:FS01:FahCore 0x22 started
15:27:13:WARNING:WU01:FS01:FahCore returned an unknown error code which probably indicates that it crashed
15:27:13:WARNING:WU01:FS01:FahCore returned: UNKNOWN_ENUM (-1073741205 = 0xc000026b)

Later the log says FahCore (presumably 0xa7) crashed several times with the same errors.
Logging seems to have stopped after that time so there is no record of the WU for project 17423 started later and which completed before I removed the cpu entry from FAHControl.

Original:
Still seeing unusually low PPD for project 16927.
Summary for Work Unit(16927 (30, 1175, 33)): low core count used, low PPD per core, 20 days is a surprisingly long time given to complete.
1. Part of the cause for the low PPD it that this work unit is only using four cores, at 100%, out of 14 on a 10940X processor.
I know of no way to find out how fast this particular WU it would run on the GPU for comparison.
2. The PPD per core is also low - less than half what I would expect per core for a COVID-19 unit running on the same 0xA7 FahCore.
3. The above unit is allowing 20 days to complete per the previous post. Does this seem high given that the actual ETA was about 4 or 5 hours?

Re: Unusually low PPD on project 16927?

Posted: Sat Feb 13, 2021 7:47 pm
by bruce
Yes, WUs from p169xx are highly variable but that doesn't seem to be your real problem.

I'm not getting p16927 so I have to base my comments on the logs you have posted.

Where are you looking to see "the advertised number of cores"?

Are you adjusting the number of assigned threads manually? If so, when do you do so? Look back through your logs and determine how many cores were configured by the slot that initiated FAHClient's download of that project?

A slot that's configured for 4 CPUs will download WUs that cannot use more that 4 threads. For any slot that's going to download a new WU soon, increase the number allocated to some realistic number that might actually be available.

Re: Unusually low PPD on project 16927?

Posted: Sat Feb 13, 2021 8:14 pm
by Joe_H
All I can comment on is that the PPD that I have been getting on WUs from this project have been within the normal range for the systems I have. YMMV, but this applies to all projects depending on hardware and other configuration differences.