9634 (Run 0, Clone 9, Gen 5)

Moderators: Site Moderators, FAHC Science Team

Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

9634 (Run 0, Clone 9, Gen 5)

Post by Grandpa_01 »

I see these are in general circulation now, (p9625-9643 to FAH) I am still having the problem with the stalling WU's below is the log from the one I am currently running, I can pause it and get it going again but after a few percent it just slows way down 1+ hrs per frame. I changed the driver from 355.11 to 352.xx last week to hopefully clear up the problem but as you can see it did not work, the card it is running on is a Evga GTX 970 sc running at stock clocks and it has completed 26 other WU's in this past week (looks like around 60% core 18 and 40% core 21) with no problems half of which were ran with a OC of +80Mhz so I know the card is capable of running all other WU's even OCed. I will just move the card off F@H for now and run it on something else until these are all gone or fixed since when the card gets stuck in it's endless cycle it is just wasting time and electricity.

I do not know whether it is the card or the series p9625-9643 but this card has yet to complete one of them without problems I do have some cards that have completed some of them without problems but all of my GTX 9xx cards have had the same problem on at least 1 of the WU's in this series but they have completed some without issues.

P.S. Edit
There is one thing I noticed the clocks bounce around when these are running they fluctuate between 1366Mhz which is the stock clock speed which it is set at and 1426Mhz which is quite a bit of OC which I am guessing is boost state. All of the other WU's I have watched on this card will hold steady at the 1366Mhz. Perhaps that will help in figuring out what is going on here.

Code: Select all

18:57:20:WU00:FS00:0x21:*********************** Log Started 2015-10-11T18:57:20Z ***********************
18:57:20:WU00:FS00:0x21:Project: 9634 (Run 0, Clone 9, Gen 5)
18:57:20:WU00:FS00:0x21:Unit: 0x00000008ab436c9b5609bee2d21f3fed
18:57:20:WU00:FS00:0x21:CPU: 0x00000000000000000000000000000000
18:57:20:WU00:FS00:0x21:Machine: 0
18:57:20:WU00:FS00:0x21:Reading tar file core.xml
18:57:20:WU00:FS00:0x21:Reading tar file integrator.xml
18:57:20:WU00:FS00:0x21:Reading tar file state.xml
18:57:20:WU00:FS00:0x21:Reading tar file system.xml
18:57:20:WU00:FS00:0x21:Digital signatures verified
18:57:20:WU00:FS00:0x21:Folding@home GPU Core21 Folding@home Core
18:57:20:WU00:FS00:0x21:Version 0.0.11
18:57:25:WU01:FS00:Upload 29.89%
18:57:31:WU01:FS00:Upload 62.99%
18:57:59:WU00:FS00:0x21:Completed 0 out of 2000000 steps (0%)
18:57:59:WU00:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
18:58:36:WU01:FS00:Upload 90.58%
19:00:06:WU00:FS00:0x21:Completed 20000 out of 2000000 steps (1%)
19:02:04:WU00:FS00:0x21:Completed 40000 out of 2000000 steps (2%)
19:04:03:WU00:FS00:0x21:Completed 60000 out of 2000000 steps (3%)
19:06:02:WU00:FS00:0x21:Completed 80000 out of 2000000 steps (4%)
19:08:01:WU00:FS00:0x21:Completed 100000 out of 2000000 steps (5%)
19:10:10:WU00:FS00:0x21:Completed 120000 out of 2000000 steps (6%)
19:12:09:WU00:FS00:0x21:Completed 140000 out of 2000000 steps (7%)
19:14:07:WU00:FS00:0x21:Completed 160000 out of 2000000 steps (8%)
19:15:47:WU01:FS00:Upload 92.42%
19:15:47:WARNING:WU01:FS00:Exception: Failed to send results to work server: Transfer failed
19:15:47:WU01:FS00:Trying to send results to collection server
19:15:47:WU01:FS00:Uploading 13.59MiB to 171.65.103.160
19:15:47:WU01:FS00:Connecting to 171.65.103.160:8080
19:15:53:WU01:FS00:Upload 35.40%
19:15:59:WU01:FS00:Upload 65.75%
19:16:05:WU01:FS00:Upload 96.56%
19:16:06:WU00:FS00:0x21:Completed 180000 out of 2000000 steps (9%)
19:16:07:WU01:FS00:Upload complete
19:16:07:WU01:FS00:Server responded WORK_ACK (400)
19:16:07:WU01:FS00:Final credit estimate, 78406.00 points
19:16:07:WU01:FS00:Cleaning up
19:18:04:WU00:FS00:0x21:Completed 200000 out of 2000000 steps (10%)
19:20:13:WU00:FS00:0x21:Completed 220000 out of 2000000 steps (11%)
19:22:12:WU00:FS00:0x21:Completed 240000 out of 2000000 steps (12%)
19:24:10:WU00:FS00:0x21:Completed 260000 out of 2000000 steps (13%)
19:26:09:WU00:FS00:0x21:Completed 280000 out of 2000000 steps (14%)
19:28:07:WU00:FS00:0x21:Completed 300000 out of 2000000 steps (15%)
19:30:16:WU00:FS00:0x21:Completed 320000 out of 2000000 steps (16%)
19:32:15:WU00:FS00:0x21:Completed 340000 out of 2000000 steps (17%)
19:34:13:WU00:FS00:0x21:Completed 360000 out of 2000000 steps (18%)
19:36:12:WU00:FS00:0x21:Completed 380000 out of 2000000 steps (19%)
19:38:10:WU00:FS00:0x21:Completed 400000 out of 2000000 steps (20%)
19:40:19:WU00:FS00:0x21:Completed 420000 out of 2000000 steps (21%)
19:42:17:WU00:FS00:0x21:Completed 440000 out of 2000000 steps (22%)
19:44:16:WU00:FS00:0x21:Completed 460000 out of 2000000 steps (23%)
19:46:14:WU00:FS00:0x21:Completed 480000 out of 2000000 steps (24%)
19:48:13:WU00:FS00:0x21:Completed 500000 out of 2000000 steps (25%)
19:50:21:WU00:FS00:0x21:Completed 520000 out of 2000000 steps (26%)
19:52:20:WU00:FS00:0x21:Completed 540000 out of 2000000 steps (27%)
19:54:19:WU00:FS00:0x21:Completed 560000 out of 2000000 steps (28%)
19:56:17:WU00:FS00:0x21:Completed 580000 out of 2000000 steps (29%)
19:58:16:WU00:FS00:0x21:Completed 600000 out of 2000000 steps (30%)
20:00:24:WU00:FS00:0x21:Completed 620000 out of 2000000 steps (31%)
20:02:23:WU00:FS00:0x21:Completed 640000 out of 2000000 steps (32%)
20:04:22:WU00:FS00:0x21:Completed 660000 out of 2000000 steps (33%)
20:06:21:WU00:FS00:0x21:Completed 680000 out of 2000000 steps (34%)
20:08:19:WU00:FS00:0x21:Completed 700000 out of 2000000 steps (35%)
20:10:28:WU00:FS00:0x21:Completed 720000 out of 2000000 steps (36%)
20:12:27:WU00:FS00:0x21:Completed 740000 out of 2000000 steps (37%)
20:14:25:WU00:FS00:0x21:Completed 760000 out of 2000000 steps (38%)
20:16:24:WU00:FS00:0x21:Completed 780000 out of 2000000 steps (39%)
20:18:23:WU00:FS00:0x21:Completed 800000 out of 2000000 steps (40%)
20:20:31:WU00:FS00:0x21:Completed 820000 out of 2000000 steps (41%)
20:22:29:WU00:FS00:0x21:Completed 840000 out of 2000000 steps (42%)
20:24:28:WU00:FS00:0x21:Completed 860000 out of 2000000 steps (43%)
20:26:27:WU00:FS00:0x21:Completed 880000 out of 2000000 steps (44%)
20:28:25:WU00:FS00:0x21:Completed 900000 out of 2000000 steps (45%)
20:30:34:WU00:FS00:0x21:Completed 920000 out of 2000000 steps (46%)
20:32:32:WU00:FS00:0x21:Completed 940000 out of 2000000 steps (47%)
20:34:31:WU00:FS00:0x21:Completed 960000 out of 2000000 steps (48%)
20:42:16:WU00:FS00:0x21:Completed 980000 out of 2000000 steps (49%)
21:36:08:WU00:FS00:0x21:Completed 1000000 out of 2000000 steps (50%)
21:36:08:WU00:FS00:0x21:Bad State detected... attempting to resume from last good checkpoint
21:38:04:WU00:FS00:0x21:Completed 920000 out of 2000000 steps (46%)
21:40:00:WU00:FS00:0x21:Completed 940000 out of 2000000 steps (47%)
21:41:57:WU00:FS00:0x21:Completed 960000 out of 2000000 steps (48%)
21:43:53:WU00:FS00:0x21:Completed 980000 out of 2000000 steps (49%)
21:45:49:WU00:FS00:0x21:Completed 1000000 out of 2000000 steps (50%)
21:47:55:WU00:FS00:0x21:Completed 1020000 out of 2000000 steps (51%)
21:49:52:WU00:FS00:0x21:Completed 1040000 out of 2000000 steps (52%)
21:51:49:WU00:FS00:0x21:Completed 1060000 out of 2000000 steps (53%)
21:53:45:WU00:FS00:0x21:Completed 1080000 out of 2000000 steps (54%)
21:55:41:WU00:FS00:0x21:Completed 1100000 out of 2000000 steps (55%)
21:57:48:WU00:FS00:0x21:Completed 1120000 out of 2000000 steps (56%)
21:59:44:WU00:FS00:0x21:Completed 1140000 out of 2000000 steps (57%)
22:01:40:WU00:FS00:0x21:Completed 1160000 out of 2000000 steps (58%)
22:03:37:WU00:FS00:0x21:Completed 1180000 out of 2000000 steps (59%)
22:05:00:FS00:Paused
22:05:00:FS00:Shutting core down
22:05:00:WU00:FS00:0x21:Caught signal SIGINT(2) on PID 33953
22:05:00:WU00:FS00:0x21:Exiting, please wait. . .
22:05:01:WU00:FS00:0x21:Folding@home Core Shutdown: INTERRUPTED
22:05:01:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
22:05:11:FS00:Unpaused
22:05:11:WU00:FS00:Starting
22:05:11:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 00 -suffix 01 -version 704 -lifeline 1733 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
22:05:11:WU00:FS00:Started FahCore on PID 39096
22:05:11:WU00:FS00:Core PID:39100
22:05:11:WU00:FS00:FahCore 0x21 started
22:05:12:WU00:FS00:0x21:*********************** Log Started 2015-10-11T22:05:11Z ***********************
22:05:12:WU00:FS00:0x21:Project: 9634 (Run 0, Clone 9, Gen 5)
22:05:12:WU00:FS00:0x21:Unit: 0x00000008ab436c9b5609bee2d21f3fed
22:05:12:WU00:FS00:0x21:CPU: 0x00000000000000000000000000000000
22:05:12:WU00:FS00:0x21:Machine: 0
22:05:12:WU00:FS00:0x21:Digital signatures verified
22:05:12:WU00:FS00:0x21:Folding@home GPU Core21 Folding@home Core
22:05:12:WU00:FS00:0x21:Version 0.0.11
22:05:12:WU00:FS00:0x21:  Found a checkpoint file
22:05:52:WU00:FS00:0x21:Completed 1100000 out of 2000000 steps (55%)
22:05:52:WU00:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
22:07:59:WU00:FS00:0x21:Completed 1120000 out of 2000000 steps (56%)
22:09:57:WU00:FS00:0x21:Completed 1140000 out of 2000000 steps (57%)
22:11:56:WU00:FS00:0x21:Completed 1160000 out of 2000000 steps (58%)
22:13:55:WU00:FS00:0x21:Completed 1180000 out of 2000000 steps (59%)
22:15:54:WU00:FS00:0x21:Completed 1200000 out of 2000000 steps (60%)
23:01:13:FS00:Paused
23:01:13:FS00:Shutting core down
23:01:13:WU00:FS00:0x21:Caught signal SIGINT(2) on PID 39100
23:01:13:WU00:FS00:0x21:Exiting, please wait. . .
23:01:17:WU00:FS00:0x21:Folding@home Core Shutdown: INTERRUPTED
23:01:17:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
23:01:23:FS00:Unpaused
23:01:23:WU00:FS00:Starting
23:01:23:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 00 -suffix 01 -version 704 -lifeline 1733 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
23:01:23:WU00:FS00:Started FahCore on PID 40569
23:01:23:WU00:FS00:Core PID:40573
23:01:23:WU00:FS00:FahCore 0x21 started
23:01:24:WU00:FS00:0x21:*********************** Log Started 2015-10-11T23:01:23Z ***********************
23:01:24:WU00:FS00:0x21:Project: 9634 (Run 0, Clone 9, Gen 5)
23:01:24:WU00:FS00:0x21:Unit: 0x00000008ab436c9b5609bee2d21f3fed
23:01:24:WU00:FS00:0x21:CPU: 0x00000000000000000000000000000000
23:01:24:WU00:FS00:0x21:Machine: 0
23:01:24:WU00:FS00:0x21:Digital signatures verified
23:01:24:WU00:FS00:0x21:Folding@home GPU Core21 Folding@home Core
23:01:24:WU00:FS00:0x21:Version 0.0.11
23:01:24:WU00:FS00:0x21:  Found a checkpoint file
23:02:07:WU00:FS00:0x21:Completed 1200000 out of 2000000 steps (60%)
23:02:07:WU00:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
23:04:14:WU00:FS00:0x21:Completed 1220000 out of 2000000 steps (61%)
23:06:14:WU00:FS00:0x21:Completed 1240000 out of 2000000 steps (62%)
23:08:14:WU00:FS00:0x21:Completed 1260000 out of 2000000 steps (63%)
23:10:14:WU00:FS00:0x21:Completed 1280000 out of 2000000 steps (64%)
23:12:15:WU00:FS00:0x21:Completed 1300000 out of 2000000 steps (65%)
23:14:25:WU00:FS00:0x21:Completed 1320000 out of 2000000 steps (66%)
23:16:25:WU00:FS00:0x21:Completed 1340000 out of 2000000 steps (67%)
23:18:25:WU00:FS00:0x21:Completed 1360000 out of 2000000 steps (68%)
23:20:24:WU00:FS00:0x21:Completed 1380000 out of 2000000 steps (69%)
23:22:25:WU00:FS00:0x21:Completed 1400000 out of 2000000 steps (70%)
23:24:35:WU00:FS00:0x21:Completed 1420000 out of 2000000 steps (71%)
23:26:35:WU00:FS00:0x21:Completed 1440000 out of 2000000 steps (72%)
23:28:35:WU00:FS00:0x21:Completed 1460000 out of 2000000 steps (73%)
23:30:35:WU00:FS00:0x21:Completed 1480000 out of 2000000 steps (74%)
23:32:35:WU00:FS00:0x21:Completed 1500000 out of 2000000 steps (75%)
23:34:46:WU00:FS00:0x21:Completed 1520000 out of 2000000 steps (76%)
23:36:46:WU00:FS00:0x21:Completed 1540000 out of 2000000 steps (77%)
23:42:18:WU00:FS00:0x21:Completed 1560000 out of 2000000 steps (78%)
******************************* Date: 2015-10-12 *******************************
00:46:39:WU00:FS00:0x21:Completed 1580000 out of 2000000 steps (79%)
01:29:37:FS00:Paused
01:29:38:FS00:Shutting core down
01:29:38:WU00:FS00:0x21:Caught signal SIGINT(2) on PID 40573
01:29:38:WU00:FS00:0x21:Exiting, please wait. . .
01:29:40:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
01:29:51:FS00:Unpaused
01:29:51:WU00:FS00:Starting
01:29:51:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 00 -suffix 01 -version 704 -lifeline 1733 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
01:29:51:WU00:FS00:Started FahCore on PID 44397
01:29:51:WU00:FS00:Core PID:44401
01:29:51:WU00:FS00:FahCore 0x21 started
01:29:52:WU00:FS00:0x21:*********************** Log Started 2015-10-12T01:29:51Z ***********************
01:29:52:WU00:FS00:0x21:Project: 9634 (Run 0, Clone 9, Gen 5)
01:29:52:WU00:FS00:0x21:Unit: 0x00000008ab436c9b5609bee2d21f3fed
01:29:52:WU00:FS00:0x21:CPU: 0x00000000000000000000000000000000
01:29:52:WU00:FS00:0x21:Machine: 0
01:29:52:WU00:FS00:0x21:Digital signatures verified
01:29:52:WU00:FS00:0x21:Folding@home GPU Core21 Folding@home Core
01:29:52:WU00:FS00:0x21:Version 0.0.11
01:29:52:WU00:FS00:0x21:  Found a checkpoint file
01:30:32:WU00:FS00:0x21:Completed 1500000 out of 2000000 steps (75%)
01:30:32:WU00:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
01:32:40:WU00:FS00:0x21:Completed 1520000 out of 2000000 steps (76%)
01:34:40:WU00:FS00:0x21:Completed 1540000 out of 2000000 steps (77%)
01:36:39:WU00:FS00:0x21:Completed 1560000 out of 2000000 steps (78%)
01:38:39:WU00:FS00:0x21:Completed 1580000 out of 2000000 steps (79%)
01:40:38:WU00:FS00:0x21:Completed 1600000 out of 2000000 steps (80%)
01:42:49:WU00:FS00:0x21:Completed 1620000 out of 2000000 steps (81%)
01:44:48:WU00:FS00:0x21:Completed 1640000 out of 2000000 steps (82%)
01:46:48:WU00:FS00:0x21:Completed 1660000 out of 2000000 steps (83%)
01:48:47:WU00:FS00:0x21:Completed 1680000 out of 2000000 steps (84%)
01:50:47:WU00:FS00:0x21:Completed 1700000 out of 2000000 steps (85%)
01:52:57:WU00:FS00:0x21:Completed 1720000 out of 2000000 steps (86%)
01:54:57:WU00:FS00:0x21:Completed 1740000 out of 2000000 steps (87%)
01:56:56:WU00:FS00:0x21:Completed 1760000 out of 2000000 steps (88%)
01:58:56:WU00:FS00:0x21:Completed 1780000 out of 2000000 steps (89%)
02:00:55:WU00:FS00:0x21:Completed 1800000 out of 2000000 steps (90%)
02:03:05:WU00:FS00:0x21:Completed 1820000 out of 2000000 steps (91%)
02:05:05:WU00:FS00:0x21:Completed 1840000 out of 2000000 steps (92%)
02:07:05:WU00:FS00:0x21:Completed 1860000 out of 2000000 steps (93%)
02:09:05:WU00:FS00:0x21:Completed 1880000 out of 2000000 steps (94%)
02:11:05:WU00:FS00:0x21:Completed 1900000 out of 2000000 steps (95%)
02:13:15:WU00:FS00:0x21:Completed 1920000 out of 2000000 steps (96%)
02:15:14:WU00:FS00:0x21:Completed 1940000 out of 2000000 steps (97%)
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 9634 (Run 0, Clone 9, Gen 5)

Post by bruce »

Grandpa_01 wrote:I see these are in general circulation now, (p9625-9643 to FAH) I am still having the problem with the stalling WU's below is the log from the one I am currently running, I can pause it and get it going again but after a few percent it just slows way down 1+ hrs per frame. I changed the driver from 355.11 to 352.xx last week to hopefully clear up the problem but as you can see it did not work, the card it is running on is a Evga GTX 970 sc running at stock clocks and it has completed 26 other WU's in this past week (looks like around 60% core 18 and 40% core 21) with no problems half of which were ran with a OC of +80Mhz so I know the card is capable of running all other WU's even OCed. I will just move the card off F@H for now and run it on something else until these are all gone or fixed since when the card gets stuck in it's endless cycle it is just wasting time and electricity.

I do not know whether it is the card or the series p9625-9643 but this card has yet to complete one of them without problems I do have some cards that have completed some of them without problems but all of my GTX 9xx cards have had the same problem on at least 1 of the WU's in this series but they have completed some without issues.

There is one thing I noticed the clocks bounce around when these are running they fluctuate between 1366Mhz which is the stock clock speed which it is set at and 1426Mhz which is quite a bit of OC which I am guessing is boost state. All of the other WU's I have watched on this card will hold steady at the 1366Mhz. Perhaps that will help in figuring out what is going on here.
I suspect it's neither the core nor the WUs themselves. My money is on some driver or GPU bios. Saying your WUs were 60% core18 and 40% core21 and overall about half of them had problems doesn't help us isolate anything.
artoar_11
Posts: 652
Joined: Sun Nov 22, 2009 8:42 pm
Hardware configuration: AMD R7 3700X @ 4.0 GHz; ASUS ROG STRIX X470-F GAMING; DDR4 2x8GB @ 3.0 GHz; GByte RTX 3060 Ti @ 1890 MHz; Fortron-550W 80+ bronze; Win10 Pro/64
Location: Bulgaria/Team #224497/artoar11_ALL_....

Re: 9634 (Run 0, Clone 9, Gen 5)

Post by artoar_11 »

Grandpa_01 wrote: ..............
P.S. Edit
There is one thing I noticed the clocks bounce around when these are running they fluctuate between 1366Mhz which is the stock clock speed which it is set at and 1426Mhz which is quite a bit of OC which I am guessing is boost state. All of the other WU's I have watched on this card will hold steady at the 1366Mhz. Perhaps that will help in figuring out what is going on here.
The frequency and voltage jumps down/up, because the card hit Power Limit (PL). I'm not 100% sure, but as far as I see most problems with 900 Maxwell. My Asus 970 Strix has a factory overclock (+ Boost) - 1300 MHz. On PL - 120% some 0x21 WUs exceed these 120% and the card gets throttling (↓↑↓↑). The temperature is only about - 67*C max, fans - 55% (1700 rpm).
Several days I left the card at 1200 MHz (PL-120% w/o throttling), but again Bad State on "beta".
Without being an expert in my humble opinion, the hardware is overload by some of 0x21 WUs (at factory frequency and PL - 100% by default). On projects 0x17 and 0x18 no problems even at frequency 1450 MHz. Far down from PL.
I use MSI Afterburner 4.1.1.
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: 9634 (Run 0, Clone 9, Gen 5)

Post by Grandpa_01 »

Bruce

The 26 other WU,s referred to above were different series WU's 9xxx-10xxx etc. it was just posted as a example that the card had run the 26 other WU's between this WU and the last p9625-9643 series and is stable on all other work.

@artoar_11 the card is set to 0 OC in xserver so to my understanding it should not be going above the default 1366Mhz but it is. Why is it doing this and why does it only do this with the newer series of core21 WU's I have reported this with other new beta WU's, I have a feeling it may have something to do with the new core21 but that is just a theory.

This may be something they may want to look into, does this only happen on OCed cards, the card in question is a factory OCed card.
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 9634 (Run 0, Clone 9, Gen 5)

Post by bruce »

artoar_11 wrote:... and the card gets throttling (↓↑↓↑). The temperature is only about - 67*C max, fans - 55% (1700 rpm).
You might or might not be right, but it seems to me you're suggesting that throttling is not designed right by NV.

Question1: Why is the fan only at 55%?

Question2: how fast is the ↓↑↓↑ ?

FAH has tried embedding a throttling function into certain GPU FahCores, but nothing they've tried has been acceptable for GPUs that don't have a built-in throttling function. IMHO, throttling MIGHT be made to work if it is built into the GPU's bios, but it will probably never work well if it's in the drivers or the App.
toTOW
Site Moderator
Posts: 6349
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: 9634 (Run 0, Clone 9, Gen 5)

Post by toTOW »

The cause of the problem described by Grandpa_01 has been identified, and a fix should be implemented in the next version of Core 21. We don't know yet how they'll deal with the issue, but at least the WU shouldn't stay stalled anymore.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
artoar_11
Posts: 652
Joined: Sun Nov 22, 2009 8:42 pm
Hardware configuration: AMD R7 3700X @ 4.0 GHz; ASUS ROG STRIX X470-F GAMING; DDR4 2x8GB @ 3.0 GHz; GByte RTX 3060 Ti @ 1890 MHz; Fortron-550W 80+ bronze; Win10 Pro/64
Location: Bulgaria/Team #224497/artoar11_ALL_....

Re: 9634 (Run 0, Clone 9, Gen 5)

Post by artoar_11 »

bruce wrote: You might or might not be right, but it seems to me you're suggesting that throttling is not designed right by NV.

Question1: Why is the fan only at 55%?

Question2: how fast is the ↓↑↓↑ ?

FAH has tried embedding a throttling function into certain GPU FahCores, but nothing they've tried has been acceptable for GPUs that don't have a built-in throttling function. IMHO, throttling MIGHT[b} be made to work if it is built into the GPU's bios, but it will probably never work well if it's in the drivers or the App.

Maybe not express myself well. Yes throttling is embedded in vBIOS of the video card. My concerns are that the software (0x21 core) causes overloading the hardware (a high power consumption).

Here the question is formed more directly - https://www.reddit.com/r/foldingathome/ ... rclocking/
ChristianVirtual
Posts: 1576
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

Re: 9634 (Run 0, Clone 9, Gen 5)

Post by ChristianVirtual »

It certainly change frequently above TPD value (here with P9212)

970, nV355.11, CentOS 7, Core 21

http://imageshack.com/a/img912/5605/8pLRSp.png

With a cap at 151W as per nvidia-smi
ImageImage
Please contribute your logs to http://ppd.fahmm.net
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: 9634 (Run 0, Clone 9, Gen 5)

Post by Grandpa_01 »

And another one this is on a brand new Nvidia GTX 980 Classified the top of the line Nvidia card I just got it on Friday, it has completed 22 other WU's prior to receiving the current 96xx series WU without any problems at the exact same settings. So I am assuming factory OCed cards are no longer capable of running F@H reliably. Maybe these should be moved back to beta or advanced. :?

Code: Select all

22:28:39:WU01:FS01:0x21:*********************** Log Started 2015-10-12T22:28:39Z ***********************
22:28:39:WU01:FS01:0x21:Project: 9627 (Run 0, Clone 10, Gen 8)
22:28:39:WU01:FS01:0x21:Unit: 0x00000008ab436c9b5609bee1f7a821b2
22:28:39:WU01:FS01:0x21:CPU: 0x00000000000000000000000000000000
22:28:39:WU01:FS01:0x21:Machine: 1
22:28:39:WU01:FS01:0x21:Reading tar file core.xml
22:28:39:WU01:FS01:0x21:Reading tar file integrator.xml
22:28:39:WU01:FS01:0x21:Reading tar file state.xml
22:28:39:WU01:FS01:0x21:Reading tar file system.xml
22:28:39:WU01:FS01:0x21:Digital signatures verified
22:28:39:WU01:FS01:0x21:Folding@home GPU Core21 Folding@home Core
22:28:39:WU01:FS01:0x21:Version 0.0.11
22:28:44:WU00:FS01:Upload 55.47%
22:28:56:WU00:FS01:Upload complete
22:28:56:WU00:FS01:Server responded WORK_ACK (400)
22:28:56:WU00:FS01:Final credit estimate, 39305.00 points
22:28:56:WU00:FS01:Cleaning up
22:29:20:WU01:FS01:0x21:Completed 0 out of 2000000 steps (0%)
22:29:20:WU01:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
22:31:01:WU01:FS01:0x21:Completed 20000 out of 2000000 steps (1%)
22:32:33:WU01:FS01:0x21:Completed 40000 out of 2000000 steps (2%)
22:34:05:WU01:FS01:0x21:Completed 60000 out of 2000000 steps (3%)
22:35:37:WU01:FS01:0x21:Completed 80000 out of 2000000 steps (4%)
******************************* Date: 2015-10-12 *******************************
23:07:01:WU01:FS01:0x21:Completed 100000 out of 2000000 steps (5%)
23:07:01:WU01:FS01:0x21:Bad State detected... attempting to resume from last good checkpoint
23:08:33:WU01:FS01:0x21:Completed 20000 out of 2000000 steps (1%)
23:10:05:WU01:FS01:0x21:Completed 40000 out of 2000000 steps (2%)
23:44:19:WU01:FS01:0x21:Completed 60000 out of 2000000 steps (3%)
00:33:25:WU01:FS01:0x21:Completed 80000 out of 2000000 steps (4%)
01:08:06:FS01:Paused
01:08:06:FS01:Shutting core down
01:08:06:WU01:FS01:0x21:Caught signal SIGINT(2) on PID 9476
01:08:06:WU01:FS01:0x21:Exiting, please wait. . .
01:08:08:WU01:FS01:0x21:Folding@home Core Shutdown: INTERRUPTED
01:08:08:WU01:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
01:08:15:FS01:Unpaused
01:08:16:WU01:FS01:Starting
01:08:16:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 01 -suffix 01 -version 704 -lifeline 1781 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
01:08:16:WU01:FS01:Started FahCore on PID 13582
01:08:16:WU01:FS01:Core PID:13586
01:08:16:WU01:FS01:FahCore 0x21 started
01:08:16:WU01:FS01:0x21:*********************** Log Started 2015-10-13T01:08:16Z ***********************
01:08:16:WU01:FS01:0x21:Project: 9627 (Run 0, Clone 10, Gen 8)
01:08:16:WU01:FS01:0x21:Unit: 0x00000008ab436c9b5609bee1f7a821b2
01:08:16:WU01:FS01:0x21:CPU: 0x00000000000000000000000000000000
01:08:16:WU01:FS01:0x21:Machine: 1
01:08:16:WU01:FS01:0x21:Digital signatures verified
01:08:16:WU01:FS01:0x21:Folding@home GPU Core21 Folding@home Core
01:08:16:WU01:FS01:0x21:Version 0.0.11
01:08:55:WU01:FS01:0x21:Completed 0 out of 2000000 steps (0%)
01:08:55:WU01:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
01:10:36:WU01:FS01:0x21:Completed 20000 out of 2000000 steps (1%)
01:12:08:WU01:FS01:0x21:Completed 40000 out of 2000000 steps (2%)
01:13:41:WU01:FS01:0x21:Completed 60000 out of 2000000 steps (3%)
01:15:13:WU01:FS01:0x21:Completed 80000 out of 2000000 steps (4%)
01:16:46:WU01:FS01:0x21:Completed 100000 out of 2000000 steps (5%)
01:18:29:WU01:FS01:0x21:Completed 120000 out of 2000000 steps (6%)
01:20:02:WU01:FS01:0x21:Completed 140000 out of 2000000 steps (7%)
01:21:35:WU01:FS01:0x21:Completed 160000 out of 2000000 steps (8%)
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 9634 (Run 0, Clone 9, Gen 5)

Post by bruce »

artoar_11 wrote:Maybe not express myself well. Yes throttling is embedded in vBIOS of the video card. My concerns are that the software (0x21 core) causes overloading the hardware (a high power consumption).
You are expressing yourself very clearly. Whatever is currently in the vBIOS isn't working.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 9634 (Run 0, Clone 9, Gen 5)

Post by bruce »

It is regulating with -0 / +10% of 151 W. You're asking for something that regulates the power (temperature) to a tighter tolerance ... or maybe you want to regulate between +0% and -10% rather than going up to +10%.

I've designed lots of control systems and I would look for a couple of possible limitations.
1) The thermal sensor is too far away from the heat source, resulting in a serous lag before it showing up as incorrect and in need of an adjustment.
2) The adjustments may be limited to discrete choices of frequency resulting in larger steps than needed leading to over-correction rather than being smoothed out to a value that provides a nice average.

Overclocking might be changing some of the parameters that the regulation logic uses, making it operate incorrectly, but I doubt it. Overclock (factory or otherwise) can improve game performance, mostly because the load is intermittent so it spends some time in the unregulated regions. With a continuous load like FAH, overclocking would simply get it to the regulated power (or temperature) sooner, at which point the vBIOS is overriding the overclock. You can't have it both ways.
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: 9634 (Run 0, Clone 9, Gen 5)

Post by Grandpa_01 »

I am starting to think this is just a Linux problem, I have a Evga GTX 980 Classified that is running at pretty high OC 1500Mhz it has run 2 of this series with no problem, but all of the Linux rigs have had the problem.
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 9634 (Run 0, Clone 9, Gen 5)

Post by bruce »

Grandpa_01 wrote:I am starting to think this is just a Linux problem, I have a Evga GTX 980 Classified that is running at pretty high OC 1500Mhz it has run 2 of this series with no problem, but all of the Linux rigs have had the problem.
Yes, it could be related to the driver version. Does your problem go away if you un-OC (down to the official clock rate)?
GeForce GTX 980 GM204
Base core clock 1126
Boost core clock 1216

If so, start there and treat it like a new case for overclocking, increasing gradually to some number below where the problem occurs -- and let us know what you find.
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: 9634 (Run 0, Clone 9, Gen 5)

Post by Grandpa_01 »

I can give it a shot, I have never tried the reference card base clock. I will drop it on the next one and see what happens.
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
bigblock990
Posts: 20
Joined: Wed Sep 09, 2015 12:42 pm

Re: 9634 (Run 0, Clone 9, Gen 5)

Post by bigblock990 »

Not sure if the windows guys are having problems, but all of the openmm_21 projects are very difficult in Linux. I have backed my OC's way down, and I have a titan x running at -39mhz from stock. All my stuff is nvidia maxwell, so can't speak for how they work on AMD gpu's. The older core21 unkown_enum work well, as do all core18.

I have 2 gtx 970's that are 24/7 stable @ 1531 on core18, and have to be backed down to 1300ish for openmm_21 projects.
Post Reply