This morning I noticed an unusually poor ETA WU and researched it without any luck. Details follow:
First the system:
- Mobo: Asus P8Z77V-Pro
CPU: i5-2300
Video:- 2 x EVGA GeForce GTX 970 04G-P4-3975-KR 4GB SSC GAMING w/ACX 2.0+ (factory overclocked 13% to 1190MHz)
1 x PNY GTX 660 ti XLR8 (stock clocks)
OS: Ubuntu 14.04.3 LTS
Drivers: NVidia 355.11
Other Software: just enough to set the system up for folding: Updates, etc.
I've been folding with this rig about 2 weeks now, (1 week since I added the 660ti)
without any problems, and have had pretty consistent results. - 2 x EVGA GeForce GTX 970 04G-P4-3975-KR 4GB SSC GAMING w/ACX 2.0+ (factory overclocked 13% to 1190MHz)
Code: Select all
00:51:50:WU03:FS01:0x21:Completed 576000 out of 640000 steps (90%)
00:53:57:WU03:FS01:0x21:Completed 582400 out of 640000 steps (91%)
00:56:04:WU03:FS01:0x21:Completed 588800 out of 640000 steps (92%)
00:58:11:WU03:FS01:0x21:Completed 595200 out of 640000 steps (93%)
01:00:18:WU03:FS01:0x21:Completed 601600 out of 640000 steps (94%)
01:02:25:WU03:FS01:0x21:Completed 608000 out of 640000 steps (95%)
01:04:32:WU03:FS01:0x21:Completed 614400 out of 640000 steps (96%)
01:06:39:WU03:FS01:0x21:Completed 620800 out of 640000 steps (97%)
01:08:47:WU03:FS01:0x21:Completed 627200 out of 640000 steps (98%)
01:10:54:WU03:FS01:0x21:Completed 633600 out of 640000 steps (99%)
01:10:55:WU02:FS01:Connecting to 171.67.108.45:80
01:10:55:WU02:FS01:Assigned to work server 171.64.65.98
01:10:55:WU02:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GM204 [GeForce GTX 970] from 171.64.65.98
01:10:55:WU02:FS01:Connecting to 171.64.65.98:8080
01:10:56:WU02:FS01:Downloading 15.72MiB
01:11:02:WU02:FS01:Download 62.42%
01:11:05:WU02:FS01:Download complete
01:11:05:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:9704 run:42 clone:7 gen:105 core:0x21 unit:0x000000a5ab404162553ec24437ab4ea2
01:13:01:WU03:FS01:0x21:Completed 640000 out of 640000 steps (100%)
01:13:23:WU03:FS01:0x21:Saving result file logfile_01.txt
01:13:23:WU03:FS01:0x21:Saving result file checkpointState.xml
01:13:27:WU03:FS01:0x21:Saving result file checkpt.crc
01:13:27:WU03:FS01:0x21:Saving result file log.txt
01:13:27:WU03:FS01:0x21:Saving result file positions.xtc
01:13:28:WU03:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
01:13:29:WU03:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
01:13:29:WU03:FS01:Sending unit results: id:03 state:SEND error:NO_ERROR project:9704 run:42 clone:13 gen:98 core:0x21 unit:0x0000008eab404162553ec2447610fe95
01:13:29:WU03:FS01:Uploading 15.01MiB to 171.64.65.98
01:13:29:WU03:FS01:Connecting to 171.64.65.98:8080
01:13:29:WU02:FS01:Starting
01:13:29:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1202 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
01:13:29:WU02:FS01:Started FahCore on PID 11683
01:13:29:WU02:FS01:Core PID:11687
01:13:29:WU02:FS01:FahCore 0x21 started
01:13:29:WU02:FS01:0x21:*********************** Log Started 2015-11-18T01:13:29Z ***********************
01:13:29:WU02:FS01:0x21:Project: 9704 (Run 42, Clone 7, Gen 105)
01:13:29:WU02:FS01:0x21:Unit: 0x000000a5ab404162553ec24437ab4ea2
01:13:29:WU02:FS01:0x21:CPU: 0x00000000000000000000000000000000
01:13:29:WU02:FS01:0x21:Machine: 1
01:13:29:WU02:FS01:0x21:Reading tar file core.xml
01:13:29:WU02:FS01:0x21:Reading tar file system.xml
01:13:31:WU02:FS01:0x21:Reading tar file integrator.xml
01:13:31:WU02:FS01:0x21:Reading tar file state.xml
01:13:33:WU02:FS01:0x21:Digital signatures verified
01:13:34:WU02:FS01:0x21:Folding@home GPU Core21 Folding@home Core
01:13:34:WU02:FS01:0x21:Version 0.0.12
01:13:35:WU03:FS01:Upload 10.83%
01:13:41:WU03:FS01:Upload 20.40%
01:13:47:WU03:FS01:Upload 28.31%
01:13:53:WU03:FS01:Upload 37.89%
01:13:59:WU03:FS01:Upload 47.05%
01:14:05:WU03:FS01:Upload 56.21%
01:14:11:WU03:FS01:Upload 65.37%
01:14:17:WU03:FS01:Upload 75.36%
01:14:23:WU03:FS01:Upload 83.69%
01:14:29:WU03:FS01:Upload 92.43%
01:15:07:WU03:FS01:Upload complete
01:15:07:WU03:FS01:Server responded WORK_ACK (400)
01:15:07:WU03:FS01:Final credit estimate, 44494.00 points
01:15:08:WU03:FS01:Cleaning up
01:15:32:WU02:FS01:0x21:Completed 0 out of 640000 steps (0%)
01:15:33:WU02:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
01:17:57:WU02:FS01:0x21:Completed 6400 out of 640000 steps (1%)
01:20:04:WU02:FS01:0x21:Completed 12800 out of 640000 steps (2%)
01:22:11:WU02:FS01:0x21:Completed 19200 out of 640000 steps (3%)
******************************* Date: 2015-11-18 *******************************
01:24:18:WU02:FS01:0x21:Completed 25600 out of 640000 steps (4%)
01:26:25:WU02:FS01:0x21:Completed 32000 out of 640000 steps (5%)
01:28:32:WU02:FS01:0x21:Completed 38400 out of 640000 steps (6%)
01:30:38:WU02:FS01:0x21:Completed 44800 out of 640000 steps (7%)
01:32:45:WU02:FS01:0x21:Completed 51200 out of 640000 steps (8%)
01:34:52:WU02:FS01:0x21:Completed 57600 out of 640000 steps (9%)
01:36:59:WU02:FS01:0x21:Completed 64000 out of 640000 steps (10%)
01:39:06:WU02:FS01:0x21:Completed 70400 out of 640000 steps (11%)
01:41:12:WU02:FS01:0x21:Completed 76800 out of 640000 steps (12%)
01:43:40:WU02:FS01:0x21:Completed 83200 out of 640000 steps (13%)
01:45:47:WU02:FS01:0x21:Completed 89600 out of 640000 steps (14%)
01:47:54:WU02:FS01:0x21:Completed 96000 out of 640000 steps (15%)
01:50:00:WU02:FS01:0x21:Completed 102400 out of 640000 steps (16%)
01:52:07:WU02:FS01:0x21:Completed 108800 out of 640000 steps (17%)
01:54:14:WU02:FS01:0x21:Completed 115200 out of 640000 steps (18%)
01:56:21:WU02:FS01:0x21:Completed 121600 out of 640000 steps (19%)
01:58:28:WU02:FS01:0x21:Completed 128000 out of 640000 steps (20%)
02:00:34:WU02:FS01:0x21:Completed 134400 out of 640000 steps (21%)
02:02:42:WU02:FS01:0x21:Completed 140800 out of 640000 steps (22%)
02:04:49:WU02:FS01:0x21:Completed 147200 out of 640000 steps (23%)
02:06:56:WU02:FS01:0x21:Completed 153600 out of 640000 steps (24%)
02:09:03:WU02:FS01:0x21:Completed 160000 out of 640000 steps (25%)
02:11:30:WU02:FS01:0x21:Completed 166400 out of 640000 steps (26%)
02:13:37:WU02:FS01:0x21:Completed 172800 out of 640000 steps (27%)
02:15:44:WU02:FS01:0x21:Completed 179200 out of 640000 steps (28%)
02:17:51:WU02:FS01:0x21:Completed 185600 out of 640000 steps (29%)
02:19:58:WU02:FS01:0x21:Completed 192000 out of 640000 steps (30%)
02:22:05:WU02:FS01:0x21:Completed 198400 out of 640000 steps (31%)
02:24:12:WU02:FS01:0x21:Completed 204800 out of 640000 steps (32%)
02:26:19:WU02:FS01:0x21:Completed 211200 out of 640000 steps (33%)
02:28:26:WU02:FS01:0x21:Completed 217600 out of 640000 steps (34%)
02:30:33:WU02:FS01:0x21:Completed 224000 out of 640000 steps (35%)
02:32:40:WU02:FS01:0x21:Completed 230400 out of 640000 steps (36%)
02:34:47:WU02:FS01:0x21:Completed 236800 out of 640000 steps (37%) <-- TPF up to this point, about 2:07 m:s
02:37:15:WU02:FS01:0x21:Completed 243200 out of 640000 steps (38%) <-- beginning of the TPF lag
04:09:01:WU02:FS01:0x21:Completed 249600 out of 640000 steps (39%) <-- TPF now dropped to about 1hr 56 min
06:05:13:WU02:FS01:0x21:Completed 256000 out of 640000 steps (40%)
******************************* Date: 2015-11-18 *******************************
08:01:06:WU02:FS01:0x21:Completed 262400 out of 640000 steps (41%)
09:57:18:WU02:FS01:0x21:Completed 268800 out of 640000 steps (42%)
11:53:53:WU02:FS01:0x21:Completed 275200 out of 640000 steps (43%)
******************************* Date: 2015-11-18 *******************************
13:49:53:WU02:FS01:0x21:Completed 281600 out of 640000 steps (44%)
15:45:52:WU02:FS01:0x21:Completed 288000 out of 640000 steps (45%) <-- folding interrupted on this slot only
16:44:35:FS01:Shutting core down
16:44:35:WU02:FS01:0x21:Caught signal SIGINT(2) on PID 11687
16:44:35:WU02:FS01:0x21:Exiting, please wait. . .
16:44:50:WU02:FS01:0x21:Folding@home Core Shutdown: INTERRUPTED
16:45:09:FS01:Paused
16:45:19:WU02:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
17:02:10:FS01:Unpaused
17:02:33:WU02:FS01:Starting
17:02:33:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1202 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
17:02:33:WU02:FS01:Started FahCore on PID 23489
17:02:33:WU02:FS01:Core PID:23493
17:02:33:WU02:FS01:FahCore 0x21 started
17:02:33:WU02:FS01:0x21:*********************** Log Started 2015-11-18T17:02:33Z ***********************
17:02:33:WU02:FS01:0x21:Project: 9704 (Run 42, Clone 7, Gen 105)
17:02:33:WU02:FS01:0x21:Unit: 0x000000a5ab404162553ec24437ab4ea2
17:02:33:WU02:FS01:0x21:CPU: 0x00000000000000000000000000000000
17:02:33:WU02:FS01:0x21:Machine: 1
17:02:33:WU02:FS01:0x21:Digital signatures verified
17:02:33:WU02:FS01:0x21:Folding@home GPU Core21 Folding@home Core
17:02:33:WU02:FS01:0x21:Version 0.0.12
17:02:33:WU02:FS01:0x21: Found a checkpoint file
17:06:06:WU02:FS01:0x21:Completed 240000 out of 640000 steps (37%)
17:06:06:WU02:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
17:07:27:WU02:FS01:0x21:Completed 243200 out of 640000 steps (38%)
17:09:35:WU02:FS01:0x21:Completed 249600 out of 640000 steps (39%)
17:11:42:WU02:FS01:0x21:Completed 256000 out of 640000 steps (40%)
17:13:49:WU02:FS01:0x21:Completed 262400 out of 640000 steps (41%)
17:15:57:WU02:FS01:0x21:Completed 268800 out of 640000 steps (42%)
17:18:04:WU02:FS01:0x21:Completed 275200 out of 640000 steps (43%)
17:20:11:WU02:FS01:0x21:Completed 281600 out of 640000 steps (44%)
17:22:18:WU02:FS01:0x21:Completed 288000 out of 640000 steps (45%)
17:24:26:WU02:FS01:0x21:Completed 294400 out of 640000 steps (46%)
The problem was entirely on Slot 1, a GTX 970 (I have an FAHControl Pic if needed, basically it shows
Slot 01 (a GTX 970), Work ID 02 at 45.54% with a TPF of 1 hour 56 minutes and an ETA of 4.39 days)
As you can see from the log, it was chugging along at about a TPF of 2 mins 07 secs,
Then at 38% (02:37:15 AM), it dropped precipitously to almost 2 hours TPF,
I just happened to check the progress of the WUs that morning and saw the problem.
At this point, I moved the slot to idle-only operation, and checked online to research WU issues.
After finding no specific information on this WU, I started it back up it, and after the client
backed up to a previous checkpoint, it started working normally again with the TPF back to 2:07.
I am happy that the problem was resolved, but would like to understand the cause in
order to determine how much babysitting these folding systems require before I
start installing them on my company servers.
I would also recommend a feature that, when a WU TPF drops off precipitously, the
folding client can alert someone (text, email, whatever works) or even automatically
back up to the last good spot and attempt a recovery, and failing that, dump the WU
entirely and start with a new one as well as notify the project admin of the problem WU.