Page 1 of 1

Core 21, 9704 (Run 42, Clone 7, Gen 105)Huge TPF dropoff@38%

Posted: Wed Nov 18, 2015 6:58 pm
by petem
I've been folding successfully for about a couple weeks (1 week since I added another video card) without issues.
This morning I noticed an unusually poor ETA WU and researched it without any luck. Details follow:

First the system:
  • Mobo: Asus P8Z77V-Pro
    CPU: i5-2300
    Video:
    • 2 x EVGA GeForce GTX 970 04G-P4-3975-KR 4GB SSC GAMING w/ACX 2.0+ (factory overclocked 13% to 1190MHz)
      1 x PNY GTX 660 ti XLR8 (stock clocks)
    PSU: 750W platinum rated (system pulls 580W max)
    OS: Ubuntu 14.04.3 LTS
    Drivers: NVidia 355.11
    Other Software: just enough to set the system up for folding: Updates, etc.

    I've been folding with this rig about 2 weeks now, (1 week since I added the 660ti)
    without any problems, and have had pretty consistent results.
The Log (from the previous successful WU to the fault and restart/recovery):

Code: Select all

00:51:50:WU03:FS01:0x21:Completed 576000 out of 640000 steps (90%)
00:53:57:WU03:FS01:0x21:Completed 582400 out of 640000 steps (91%)
00:56:04:WU03:FS01:0x21:Completed 588800 out of 640000 steps (92%)
00:58:11:WU03:FS01:0x21:Completed 595200 out of 640000 steps (93%)
01:00:18:WU03:FS01:0x21:Completed 601600 out of 640000 steps (94%)
01:02:25:WU03:FS01:0x21:Completed 608000 out of 640000 steps (95%)
01:04:32:WU03:FS01:0x21:Completed 614400 out of 640000 steps (96%)
01:06:39:WU03:FS01:0x21:Completed 620800 out of 640000 steps (97%)
01:08:47:WU03:FS01:0x21:Completed 627200 out of 640000 steps (98%)
01:10:54:WU03:FS01:0x21:Completed 633600 out of 640000 steps (99%)
01:10:55:WU02:FS01:Connecting to 171.67.108.45:80
01:10:55:WU02:FS01:Assigned to work server 171.64.65.98
01:10:55:WU02:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GM204 [GeForce GTX 970] from 171.64.65.98
01:10:55:WU02:FS01:Connecting to 171.64.65.98:8080
01:10:56:WU02:FS01:Downloading 15.72MiB
01:11:02:WU02:FS01:Download 62.42%
01:11:05:WU02:FS01:Download complete
01:11:05:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:9704 run:42 clone:7 gen:105 core:0x21 unit:0x000000a5ab404162553ec24437ab4ea2
01:13:01:WU03:FS01:0x21:Completed 640000 out of 640000 steps (100%)
01:13:23:WU03:FS01:0x21:Saving result file logfile_01.txt
01:13:23:WU03:FS01:0x21:Saving result file checkpointState.xml
01:13:27:WU03:FS01:0x21:Saving result file checkpt.crc
01:13:27:WU03:FS01:0x21:Saving result file log.txt
01:13:27:WU03:FS01:0x21:Saving result file positions.xtc
01:13:28:WU03:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
01:13:29:WU03:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
01:13:29:WU03:FS01:Sending unit results: id:03 state:SEND error:NO_ERROR project:9704 run:42 clone:13 gen:98 core:0x21 unit:0x0000008eab404162553ec2447610fe95
01:13:29:WU03:FS01:Uploading 15.01MiB to 171.64.65.98
01:13:29:WU03:FS01:Connecting to 171.64.65.98:8080
01:13:29:WU02:FS01:Starting
01:13:29:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1202 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
01:13:29:WU02:FS01:Started FahCore on PID 11683
01:13:29:WU02:FS01:Core PID:11687
01:13:29:WU02:FS01:FahCore 0x21 started
01:13:29:WU02:FS01:0x21:*********************** Log Started 2015-11-18T01:13:29Z ***********************
01:13:29:WU02:FS01:0x21:Project: 9704 (Run 42, Clone 7, Gen 105)
01:13:29:WU02:FS01:0x21:Unit: 0x000000a5ab404162553ec24437ab4ea2
01:13:29:WU02:FS01:0x21:CPU: 0x00000000000000000000000000000000
01:13:29:WU02:FS01:0x21:Machine: 1
01:13:29:WU02:FS01:0x21:Reading tar file core.xml
01:13:29:WU02:FS01:0x21:Reading tar file system.xml
01:13:31:WU02:FS01:0x21:Reading tar file integrator.xml
01:13:31:WU02:FS01:0x21:Reading tar file state.xml
01:13:33:WU02:FS01:0x21:Digital signatures verified
01:13:34:WU02:FS01:0x21:Folding@home GPU Core21 Folding@home Core
01:13:34:WU02:FS01:0x21:Version 0.0.12
01:13:35:WU03:FS01:Upload 10.83%
01:13:41:WU03:FS01:Upload 20.40%
01:13:47:WU03:FS01:Upload 28.31%
01:13:53:WU03:FS01:Upload 37.89%
01:13:59:WU03:FS01:Upload 47.05%
01:14:05:WU03:FS01:Upload 56.21%
01:14:11:WU03:FS01:Upload 65.37%
01:14:17:WU03:FS01:Upload 75.36%
01:14:23:WU03:FS01:Upload 83.69%
01:14:29:WU03:FS01:Upload 92.43%
01:15:07:WU03:FS01:Upload complete
01:15:07:WU03:FS01:Server responded WORK_ACK (400)
01:15:07:WU03:FS01:Final credit estimate, 44494.00 points
01:15:08:WU03:FS01:Cleaning up
01:15:32:WU02:FS01:0x21:Completed 0 out of 640000 steps (0%)
01:15:33:WU02:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
01:17:57:WU02:FS01:0x21:Completed 6400 out of 640000 steps (1%)
01:20:04:WU02:FS01:0x21:Completed 12800 out of 640000 steps (2%)
01:22:11:WU02:FS01:0x21:Completed 19200 out of 640000 steps (3%)
******************************* Date: 2015-11-18 *******************************
01:24:18:WU02:FS01:0x21:Completed 25600 out of 640000 steps (4%)
01:26:25:WU02:FS01:0x21:Completed 32000 out of 640000 steps (5%)
01:28:32:WU02:FS01:0x21:Completed 38400 out of 640000 steps (6%)
01:30:38:WU02:FS01:0x21:Completed 44800 out of 640000 steps (7%)
01:32:45:WU02:FS01:0x21:Completed 51200 out of 640000 steps (8%)
01:34:52:WU02:FS01:0x21:Completed 57600 out of 640000 steps (9%)
01:36:59:WU02:FS01:0x21:Completed 64000 out of 640000 steps (10%)
01:39:06:WU02:FS01:0x21:Completed 70400 out of 640000 steps (11%)
01:41:12:WU02:FS01:0x21:Completed 76800 out of 640000 steps (12%)
01:43:40:WU02:FS01:0x21:Completed 83200 out of 640000 steps (13%)
01:45:47:WU02:FS01:0x21:Completed 89600 out of 640000 steps (14%)
01:47:54:WU02:FS01:0x21:Completed 96000 out of 640000 steps (15%)
01:50:00:WU02:FS01:0x21:Completed 102400 out of 640000 steps (16%)
01:52:07:WU02:FS01:0x21:Completed 108800 out of 640000 steps (17%)
01:54:14:WU02:FS01:0x21:Completed 115200 out of 640000 steps (18%)
01:56:21:WU02:FS01:0x21:Completed 121600 out of 640000 steps (19%)
01:58:28:WU02:FS01:0x21:Completed 128000 out of 640000 steps (20%)
02:00:34:WU02:FS01:0x21:Completed 134400 out of 640000 steps (21%)
02:02:42:WU02:FS01:0x21:Completed 140800 out of 640000 steps (22%)
02:04:49:WU02:FS01:0x21:Completed 147200 out of 640000 steps (23%)
02:06:56:WU02:FS01:0x21:Completed 153600 out of 640000 steps (24%)
02:09:03:WU02:FS01:0x21:Completed 160000 out of 640000 steps (25%)
02:11:30:WU02:FS01:0x21:Completed 166400 out of 640000 steps (26%)
02:13:37:WU02:FS01:0x21:Completed 172800 out of 640000 steps (27%)
02:15:44:WU02:FS01:0x21:Completed 179200 out of 640000 steps (28%)
02:17:51:WU02:FS01:0x21:Completed 185600 out of 640000 steps (29%)
02:19:58:WU02:FS01:0x21:Completed 192000 out of 640000 steps (30%)
02:22:05:WU02:FS01:0x21:Completed 198400 out of 640000 steps (31%)
02:24:12:WU02:FS01:0x21:Completed 204800 out of 640000 steps (32%)
02:26:19:WU02:FS01:0x21:Completed 211200 out of 640000 steps (33%)
02:28:26:WU02:FS01:0x21:Completed 217600 out of 640000 steps (34%)
02:30:33:WU02:FS01:0x21:Completed 224000 out of 640000 steps (35%)
02:32:40:WU02:FS01:0x21:Completed 230400 out of 640000 steps (36%)
02:34:47:WU02:FS01:0x21:Completed 236800 out of 640000 steps (37%)          <-- TPF up to this point, about 2:07 m:s
02:37:15:WU02:FS01:0x21:Completed 243200 out of 640000 steps (38%)          <-- beginning of the TPF lag
04:09:01:WU02:FS01:0x21:Completed 249600 out of 640000 steps (39%)          <-- TPF now dropped to about 1hr 56 min
06:05:13:WU02:FS01:0x21:Completed 256000 out of 640000 steps (40%)
******************************* Date: 2015-11-18 *******************************
08:01:06:WU02:FS01:0x21:Completed 262400 out of 640000 steps (41%)
09:57:18:WU02:FS01:0x21:Completed 268800 out of 640000 steps (42%)
11:53:53:WU02:FS01:0x21:Completed 275200 out of 640000 steps (43%)
******************************* Date: 2015-11-18 *******************************
13:49:53:WU02:FS01:0x21:Completed 281600 out of 640000 steps (44%)
15:45:52:WU02:FS01:0x21:Completed 288000 out of 640000 steps (45%)          <-- folding interrupted on this slot only
16:44:35:FS01:Shutting core down
16:44:35:WU02:FS01:0x21:Caught signal SIGINT(2) on PID 11687
16:44:35:WU02:FS01:0x21:Exiting, please wait. . .
16:44:50:WU02:FS01:0x21:Folding@home Core Shutdown: INTERRUPTED
16:45:09:FS01:Paused
16:45:19:WU02:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
17:02:10:FS01:Unpaused
17:02:33:WU02:FS01:Starting
17:02:33:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1202 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
17:02:33:WU02:FS01:Started FahCore on PID 23489
17:02:33:WU02:FS01:Core PID:23493
17:02:33:WU02:FS01:FahCore 0x21 started
17:02:33:WU02:FS01:0x21:*********************** Log Started 2015-11-18T17:02:33Z ***********************
17:02:33:WU02:FS01:0x21:Project: 9704 (Run 42, Clone 7, Gen 105)
17:02:33:WU02:FS01:0x21:Unit: 0x000000a5ab404162553ec24437ab4ea2
17:02:33:WU02:FS01:0x21:CPU: 0x00000000000000000000000000000000
17:02:33:WU02:FS01:0x21:Machine: 1
17:02:33:WU02:FS01:0x21:Digital signatures verified
17:02:33:WU02:FS01:0x21:Folding@home GPU Core21 Folding@home Core
17:02:33:WU02:FS01:0x21:Version 0.0.12
17:02:33:WU02:FS01:0x21:  Found a checkpoint file
17:06:06:WU02:FS01:0x21:Completed 240000 out of 640000 steps (37%)
17:06:06:WU02:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
17:07:27:WU02:FS01:0x21:Completed 243200 out of 640000 steps (38%)
17:09:35:WU02:FS01:0x21:Completed 249600 out of 640000 steps (39%)
17:11:42:WU02:FS01:0x21:Completed 256000 out of 640000 steps (40%)
17:13:49:WU02:FS01:0x21:Completed 262400 out of 640000 steps (41%)
17:15:57:WU02:FS01:0x21:Completed 268800 out of 640000 steps (42%)
17:18:04:WU02:FS01:0x21:Completed 275200 out of 640000 steps (43%)
17:20:11:WU02:FS01:0x21:Completed 281600 out of 640000 steps (44%)
17:22:18:WU02:FS01:0x21:Completed 288000 out of 640000 steps (45%)
17:24:26:WU02:FS01:0x21:Completed 294400 out of 640000 steps (46%)
Issue/Discussion:
The problem was entirely on Slot 1, a GTX 970 (I have an FAHControl Pic if needed, basically it shows
Slot 01 (a GTX 970), Work ID 02 at 45.54% with a TPF of 1 hour 56 minutes and an ETA of 4.39 days)

As you can see from the log, it was chugging along at about a TPF of 2 mins 07 secs,
Then at 38% (02:37:15 AM), it dropped precipitously to almost 2 hours TPF,
I just happened to check the progress of the WUs that morning and saw the problem.

At this point, I moved the slot to idle-only operation, and checked online to research WU issues.
After finding no specific information on this WU, I started it back up it, and after the client
backed up to a previous checkpoint, it started working normally again with the TPF back to 2:07.

I am happy that the problem was resolved, but would like to understand the cause in
order to determine how much babysitting these folding systems require before I
start installing them on my company servers.

I would also recommend a feature that, when a WU TPF drops off precipitously, the
folding client can alert someone (text, email, whatever works) or even automatically
back up to the last good spot and attempt a recovery, and failing that, dump the WU
entirely and start with a new one as well as notify the project admin of the problem WU.

Re: Core 21, 9704 (Run 42, Clone 7, Gen 105)Huge TPF dropoff

Posted: Wed Nov 18, 2015 8:23 pm
by bollix47
Welcome to the folding@home support forum petem.

Nice first post!

What you saw is being called the 'slowdown bug' by some of us. The development team is aware of the problem and a possible fix should be in the next version of the core which is coming "soon". :ewink: The procedure you followed (i.e. right-click on the slot, select Pause and then restart it after the core has time to properly shutdown ... 30 seconds should do it) is what we've been suggesting when it happens. Unfortunately it's a relatively new bug so there's no formal notification built in to the client when it happens ... basically I've just been checking my PPD as shown in monitoring software when I can and it usually doesn't take long for that figure to fall like a stone so I've been lucky enough to catch the problem relatively quickly on those rare occasions that it happens.

Hopefully the new version of core_21 will solve the problem and the client won't need any further changes.

As far as your company implementation there is software available such as HFM.net where all your computers on your network can be monitored and Advanced Control aka FAHControl can be used to Pause and restart (Fold) any slot on those computers from one location. However, you may want to wait for the new version before rolling it out rather than setup something that's going to require someone to check all their PPD every so often ... a choice only you can make.

Re: Core 21, 9704 (Run 42, Clone 7, Gen 105)Huge TPF dropoff

Posted: Wed Nov 18, 2015 9:56 pm
by bruce
petem wrote:At this point, I moved the slot to idle-only operation, and checked online to research WU issues.
After finding no specific information on this WU, I started it back up it, and after the client
backed up to a previous checkpoint, it started working normally again with the TPF back to 2:07.
The science side of FAH wants every WU to be completed but they do not make redundant assignments unless it failed when assigned to the previous donor.

The "slow down" bug rarely results in an aborted WU so finding online information is unlikely -- though it's good that you checked. Eventually, the "slowdown bug" will force the client to restart the WU automatically, clearing the problem, although if you notice the problem and restart it manually, that's the best option.

I think the reason they're being assigned at all is because a fairly large percentage of these WUs do finish successfully (even if slowly).

Re: Core 21, 9704 (Run 42, Clone 7, Gen 105)Huge TPF dropoff

Posted: Wed Nov 18, 2015 10:03 pm
by bruce
FYI, the FAHClient sometimes mixes up multiple GPUs.
2 x GTX 970
1 x GTX 660 ti

The one that FAH reports to be a 660Ti might actually be one of the 970s. For the most part, both Kepler and Maxwell2 are eligible for the same class of assignments, but this might not always be the case. The only good way I've found to check them is to suspend them one-at-a-time and monitor the temperature. If you need to override FAH's choice, I can give you instructions.

Re: Core 21, 9704 (Run 42, Clone 7, Gen 105)Huge TPF dropoff

Posted: Thu Nov 19, 2015 3:12 am
by petem
First off before I respond, thanks for the quick replies! :D
bruce wrote:
FYI, the FAHClient sometimes mixes up multiple GPUs.
2 x GTX 970
1 x GTX 660 ti

The one that FAH reports to be a 660Ti might actually be one of the 970s. For the most part, both Kepler and Maxwell2 are eligible for the same class of assignments, but this might not always be the case. The only good way I've found to check them is to suspend them one-at-a-time and monitor the temperature. If you need to override FAH's choice, I can give you instructions.
I made a dedicated linux system to (hopefully) reduce overhead and improve stability over windows, and in the process learned a lot more about the Folding client (configuring, options, etc.) than I did from the spoon-fed windows install (such has having to download and copy the GPUs.txt file to the proper location before FAHControl would recognize them).

As it turns out, I originally installed the 2 GTX 970s and logged their power usage before I decided I had enough left over PSU power (and slot bandwidth) to add the 660ti.

When I added the 660ti slot to FAHControl, it recognized it as such. But being the anal person that I am, I still ran power profiles for each slot separately, just to compare to the previous baseline. Both 2 970's were about the the same as before, with the 660ti about 40w less. And, as expected, FAHControl showed roughly equal PPD for the 2 970's (250-275k typical), and about 1/4 as much for the 660ti (60-70k), so it seems to be OK in that respect.

Hopefully I won't have to learn how to switch them manually. If that happened I'd probably just remove the errant slot and add it back in, since that seemed to work OK the first time. :D

- Pete