Page 3 of 3

Re: Proj 13420 same variability as 13418

Posted: Sun Aug 09, 2020 6:41 am
by themartymonster
Um, as I said, it is running on the GPU as that is the only slot I set up.
I just mentioned the CPU as background.
The point was that these WUs are running for 10-20 times longer than other WUs which previously ran on the GPU and for virtually no credits given the time they take to run.
If a WU that takes 2 hours earns 250000 credits then why does a WU which runs for 1 day only earn 60000 credits.

Re: Proj 13420 same variability as 13418

Posted: Mon Aug 10, 2020 8:06 pm
by HaloJones
themartymonster wrote:Um, as I said, it is running on the GPU as that is the only slot I set up.
I just mentioned the CPU as background.
The point was that these WUs are running for 10-20 times longer than other WUs which previously ran on the GPU and for virtually no credits given the time they take to run.
If a WU that takes 2 hours earns 250000 credits then why does a WU which runs for 1 day only earn 60000 credits.
This is not the majority experience. What's the TPF on the current 13420 and what PCRG is it?

For comparison,

13420 (1564, 44, 2), 2:36
13420 (1531, 23, 2), 2:06
13420 (1526, 32, 2), 2:17
13420 (1718, 25, 2), 2:26
13420 (1686, 71, 2), 2:17
13420 (1501, 26, 2), 2:13

all but the second one are gtx1070s, the second one is a TitanX. All should be significantly slower than a properly performing RTX2070.

Please check a few things. What is the current clock speed? How hot is the card? What is the TPF of this specific unit?

Worst case, the card is dying. Best case, it's simply overheating.

Either way, no way are you being so unlucky with 13420 units that your card is getting such poor performance.

Re: Proj 13420 same variability as 13418

Posted: Mon Aug 10, 2020 11:02 pm
by themartymonster
HaloJones wrote: This is not the majority experience. What's the TPF on the current 13420 and what PCRG is it?

For comparison,

13420 (1564, 44, 2), 2:36
13420 (1531, 23, 2), 2:06
13420 (1526, 32, 2), 2:17
13420 (1718, 25, 2), 2:26
13420 (1686, 71, 2), 2:17
13420 (1501, 26, 2), 2:13

all but the second one are gtx1070s, the second one is a TitanX. All should be significantly slower than a properly performing RTX2070.

Please check a few things. What is the current clock speed? How hot is the card? What is the TPF of this specific unit?


======================================

UPDATE

Okay, found the problem.
GPU is stuck at 300MHz.
Turn on the PC and the GPU fans spin up for a few seconds and then stop spinning.
The the GPU will throttle its speed at 300MHz.
Now to see if it is the power supply or GPU which is causing it.

And yes, I took out the extra 2 RAM cards and it did not make any difference.

Worst case, the card is dying. Best case, it's simply overheating.

Either way, no way are you being so unlucky with 13420 units that your card is getting such poor performance.
Just got up and checked the PC.
It is now running WU 16918(12,49,16)
Says 3 Hours ETA and estimated points 183291.
So for now, it seems like it is back to normal.

If I get another slow WU I will post the info here.

If I remember correctly, the Estimated TPF was approx 2 hours

Re: Proj 13420 same variability as 13418

Posted: Tue Aug 11, 2020 6:31 pm
by mgetz
GPU TU104 RTX2080, Windows 10 1909

Project: 13420 (Run 7437, Clone 34, Gen 1) ~2:15 per fold but very sensitive to other GPU usage. Even keeping up graphs of power usage can add 15s per fold.
in contrast to the prior run which was much more consistant with this GPU and prior runs from this project:
Project: 13420 (Run 8019, Clone 18, Gen 1) ~1:27 per fold, with much lower variance due to other activity.

This project in general has been consistent with a few outliers on both of my TU104 GPUs (the 2080 and a 2070Super on linux)
For reference the linux GPU is currently on:
Project: 13420 (Run 6977, Clone 82, Gen 1) ~1:24 per fold

Slow WU: 13420 (7207, 66, 1)

Posted: Tue Aug 11, 2020 7:23 pm
by The_Gecko
This WU isn't throwing an error, but something clearly isn't correct. It's utilizing 50-60% of my GTX 1080 Ti. TPF is almost 4 minutes, and PPD is down to 533K. Something is not optimized. Other WUs processed in the recent past have given more reasonable numbers where TPF is 90 to 120 seconds and PPD is 1.0 to 1.2 million.

Based upon the numbers that GPU-Z and Task Manager are telling me, I have to wonder if the GPU isn't being fed fast enough. Perhaps a hardware or software cache isn't large enough to meet the execution requirements of this particular WU, causing the system to constantly fetch uncached data, which in turn slows down the rate at which you can feed the GPU. I don't see any indication of bad power, overheating CPUs, other CPU-power-sucking processes, other competing GPU processes, or disk thrashing which could cause folding to slow down. To me, it looks like the GPU simply isn't being fed fast enough by this particular WU.

My 9-to-5 job for the past 20 years has been in computer infrastructure, so that is where my brain goes. When you're a hammer, everything you see is a nail. :) If the FAH admins think this system is a good corner case for testing experimental WUs, I'm open to that.

Code: Select all

16:53:21:WU00:FS02:0x22:*********************** Log Started 2020-08-11T16:53:20Z ***********************
16:53:21:WU00:FS02:0x22:*************************** Core22 Folding@home Core ***************************
16:53:21:WU00:FS02:0x22:       Core: Core22
16:53:21:WU00:FS02:0x22:       Type: 0x22
16:53:21:WU00:FS02:0x22:    Version: 0.0.11
16:53:21:WU00:FS02:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
16:53:21:WU00:FS02:0x22:  Copyright: 2020 foldingathome.org
16:53:21:WU00:FS02:0x22:   Homepage: https://foldingathome.org/
16:53:21:WU00:FS02:0x22:       Date: Jun 26 2020
16:53:21:WU00:FS02:0x22:       Time: 19:49:16
16:53:21:WU00:FS02:0x22:   Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
16:53:21:WU00:FS02:0x22:     Branch: core22-0.0.11
16:53:21:WU00:FS02:0x22:   Compiler: Visual C++ 2015
16:53:21:WU00:FS02:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
16:53:21:WU00:FS02:0x22:   Platform: win32 10
16:53:21:WU00:FS02:0x22:       Bits: 64
16:53:21:WU00:FS02:0x22:       Mode: Release
16:53:21:WU00:FS02:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
16:53:21:WU00:FS02:0x22:             <peastman@stanford.edu>
16:53:21:WU00:FS02:0x22:       Args: -dir 00 -suffix 01 -version 706 -lifeline 22760 -checkpoint 30
16:53:21:WU00:FS02:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
16:53:21:WU00:FS02:0x22:             0 -gpu 0
16:53:21:WU00:FS02:0x22:************************************ libFAH ************************************
16:53:21:WU00:FS02:0x22:       Date: Jun 26 2020
16:53:21:WU00:FS02:0x22:       Time: 19:47:12
16:53:21:WU00:FS02:0x22:   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
16:53:21:WU00:FS02:0x22:     Branch: HEAD
16:53:21:WU00:FS02:0x22:   Compiler: Visual C++ 2015
16:53:21:WU00:FS02:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
16:53:21:WU00:FS02:0x22:   Platform: win32 10
16:53:21:WU00:FS02:0x22:       Bits: 64
16:53:21:WU00:FS02:0x22:       Mode: Release
16:53:21:WU00:FS02:0x22:************************************ CBang *************************************
16:53:21:WU00:FS02:0x22:       Date: Jun 26 2020
16:53:21:WU00:FS02:0x22:       Time: 19:46:11
16:53:21:WU00:FS02:0x22:   Revision: f8529962055b0e7bde23e429f5072ff758089dee
16:53:21:WU00:FS02:0x22:     Branch: master
16:53:21:WU00:FS02:0x22:   Compiler: Visual C++ 2015
16:53:21:WU00:FS02:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
16:53:21:WU00:FS02:0x22:   Platform: win32 10
16:53:21:WU00:FS02:0x22:       Bits: 64
16:53:21:WU00:FS02:0x22:       Mode: Release
16:53:21:WU00:FS02:0x22:************************************ System ************************************
16:53:21:WU00:FS02:0x22:        CPU: Intel(R) Xeon(R) CPU X5675 @ 3.07GHz
16:53:21:WU00:FS02:0x22:     CPU ID: GenuineIntel Family 6 Model 44 Stepping 2
16:53:21:WU00:FS02:0x22:       CPUs: 24
16:53:21:WU00:FS02:0x22:     Memory: 95.99GiB
16:53:21:WU00:FS02:0x22:Free Memory: 83.59GiB
16:53:21:WU00:FS02:0x22:    Threads: WINDOWS_THREADS
16:53:21:WU00:FS02:0x22: OS Version: 6.2
16:53:21:WU00:FS02:0x22:Has Battery: false
16:53:21:WU00:FS02:0x22: On Battery: false
16:53:21:WU00:FS02:0x22: UTC Offset: -4
16:53:21:WU00:FS02:0x22:        PID: 22324
16:53:21:WU00:FS02:0x22:        CWD: C:\Users\Greg\AppData\Roaming\FAHClient\work
16:53:21:WU00:FS02:0x22:********************************************************************************
16:53:21:WU00:FS02:0x22:Project: 13420 (Run 7207, Clone 66, Gen 1)
16:53:21:WU00:FS02:0x22:Unit: 0x0000000212bc7d9a5f2249a5b6fa681e
16:53:21:WU00:FS02:0x22:Reading tar file core.xml
16:53:21:WU00:FS02:0x22:Reading tar file integrator.xml
16:53:21:WU00:FS02:0x22:Reading tar file state.xml.bz2
16:53:21:WU00:FS02:0x22:Reading tar file system.xml.bz2
16:53:21:WU00:FS02:0x22:Digital signatures verified
16:53:21:WU00:FS02:0x22:Folding@home GPU Core22 Folding@home Core
16:53:21:WU00:FS02:0x22:Version 0.0.11
16:53:21:WU00:FS02:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
16:53:21:WU00:FS02:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
16:53:21:WU00:FS02:0x22:  XTC frame write interval: 250000 steps (25%) [4 total]
16:53:21:WU00:FS02:0x22:  Global context and integrator variables write interval: 25000 steps (2.5%) [40 total]
Image
Image

Some of you might notice that this system also contains a GTX 1060 which is also folding, but not being fully utilized. The reason for this is it is thermally restricted. The 1060 is physically behind the 1080 Ti and therefore its air intake is obstructed. This causes the 1060 to throttle back when it hits the 70 C thermal limit. This post concerns the 1080 Ti which is not thermally restricted.

Re: Proj 13420 same variability as 13418

Posted: Tue Aug 11, 2020 10:40 pm
by PantherX
Welcome to the F@H Forum The_Gecko,

Please note that the Project series 134XX are highly experimental and there have been cases where the TPF for WUs will vary dramatically. There have been quite a few iterations of this project and each iteration allowed the researchers to learn and optimize. While the performance of this experimental project has improved, there's still more that can be done.

I too have a GTX 1080 Ti and have seen a variation of TPF. This is what HFM.NET (an application that one can use to maintain a historical record of WUs) reports:
Min. Time / Frame : 00:01:52 - 1,587,612.65 PPD
Avg. Time / Frame : 00:02:22 - 1,112,086.80 PPD
Cur. Time / Frame : 00:02:56 - 850,120.92 PPD

Re: Proj 13420 same variability as 13418

Posted: Wed Aug 12, 2020 9:05 pm
by uyaem
13420 (R7275, C51, G2) - 85% GPU usage on GTX 1660Super, TPF 4:03 instead of 2:48.