Page 2 of 3
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Sun Oct 09, 2016 11:29 am
by rwh202
PS3EdOlkkola wrote:Project 11430 continues to have issues. Same problem as mentioned 4 posts ago in this thread. Problematic work units are:
11430 (R3, C40, G51) PPD 24657 TPF 12:24
11430 (R0, C16, G51) PPD 19348 TPF 14:35
Both work units were on different systems. Each system was equipped with a 980ti. Normal PPD for each 980ti is in the range of 575 to 700 PPD. The work units were dumped.
Has anyone seen a 11430 with gen >51 ? All of the slow ones reported are gen 51 - wonder if this is either a roadblock or a configured max gen?
I've only 7 pass by me with a max gen of 50 - all went happily with PPD on the high end of normal.
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Sun Oct 09, 2016 3:31 pm
by PS3EdOlkkola
Yes G53: Checked and see that I got another super-slow 11430 (R4, C49, G53) that was running with a PPD of 13647 and a TPF of 18:23 (normal for this GPU is 475K to 450K PPD). Was running on the i7-3930K 3x 980 unit. When it was running, the system was at 22% CPU utilization (normal), memory utilization was 3.72GB out of 16.32GB available (normal), and the GPU clock (GTX 980) was 1.392GHz (normal). Paused FAH, quit FAH, dumped the work unit, restarted FAH, unpaused and got a different WU. Something isn't right with >G50 work units.
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Sun Oct 09, 2016 3:59 pm
by bruce
Dumping the WU isn't recommended. All that means is that (A) the results will be delayed and (B) somebody else will have to process the WU from the beginning.
Having the owner of the project deal with the problem is the only viable route.
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Sun Oct 09, 2016 7:05 pm
by PS3EdOlkkola
I agree that dumping work units is considered bad practice and isn't a recommended step.
I think we'd also agree that it's important to make note of what looks like a real problem with a work unit after all routine diagnostic steps have been taken to ensure the issue isn't resident in a hardware, software, network or incorrect configuration setting of the processing system/slot itself. When diagnostics confirm the system is operating properly then from the donors' perspective the work unit is considered defective given how far out of the norm it's performing. If the work unit is truly defective, the results will be delayed by definition.
In an ideal world, dumping the work unit and reporting it promptly would immediately get the attention of the PI to actually do the "i" part of the job and pause the project until the issue is resolved and another donor does not waste time and energy processing a work unit that's not constructed to specification, or have the PI confirm 3x to 6x longer processing time is normal. In that context, it's very true that the only viable option to resolve an issue is through the project owner. The issue was identified on 10/6 and it's now 10/9 without a FF post from the PI providing an update. Lacking an "everything is ok, carry on" message then I default to my own experience and make the determination the work unit is defective and dump it and report it immediately. Fast issue identification and rectification is the best process for keeping all projects on-track.
The last time I can recall intentionally dumping a work unit was last December, over 100,000 work units ago. Dumping 5 work units over the last few days dropped my year-to-date work unit return percentage to 99.995%. For the record, I'm not happy with less than 100% work unit return performance.
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Mon Oct 10, 2016 9:04 am
by Duce H_K_
My point is that the project consists of 50'000'000 steps while other projects consist of 2'500'000-5'000'000 steps
Code: Select all
14:23:48:WU00:FS00:0x21:Project: 11430 (Run 3, Clone 42, Gen 53)
14:23:48:WU00:FS00:0x21:Unit: 0x000000448ca304f1574a00861670ab83
14:23:48:WU00:FS00:0x21:CPU: 0x00000000000000000000000000000000
14:23:48:WU00:FS00:0x21:Machine: 0
14:23:48:WU00:FS00:0x21:Reading tar file core.xml
14:23:48:WU00:FS00:0x21:Reading tar file system.xml
14:23:48:WU00:FS00:0x21:Reading tar file integrator.xml
14:23:48:WU00:FS00:0x21:Reading tar file state.xml
14:23:48:WU00:FS00:0x21:Digital signatures verified
14:23:48:WU00:FS00:0x21:Folding@home GPU Core21 Folding@home Core
14:23:48:WU00:FS00:0x21:Version 0.0.17
14:24:01:WU00:FS00:0x21:Completed 0 out of 50000000 steps (0%)
14:24:01:WU00:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
14:42:02:WU00:FS00:0x21:Completed 500000 out of 50000000 steps (1%)
15:00:04:WU00:FS00:0x21:Completed 1000000 out of 50000000 steps (2%)
15:18:06:WU00:FS00:0x21:Completed 1500000 out of 50000000 steps (3%)
15:36:09:WU00:FS00:0x21:Completed 2000000 out of 50000000 steps (4%)
15:36:34:Removing old file 'configs/config-20160903-062905.xml'
15:36:34:Saving configuration to config.xml
15:36:34:FS00:Shutting core down
15:36:34:WU00:FS00:0x21:WARNING:Console control signal 1 on PID 2700
15:36:34:WU00:FS00:0x21:Exiting, please wait. . .
15:36:34:WU00:FS00:0x21:Folding@home Core Shutdown: INTERRUPTED
15:36:35:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
15:36:35:WARNING:WU00:Slot ID 0 no longer exists and there are no other matching slots, dumping
15:36:35:WU00:Sending unit results: id:00 state:SEND error:DUMPED project:11430 run:3 clone:42 gen:53 core:0x21 unit:0x000000448ca304f1574a00861670ab83
15:36:35:WU00:Connecting to 140.163.4.241:8080
15:36:36:WU00:Server responded WORK_ACK (400)
15:36:36:WU00:Cleaning up
Avg. Time / Frame : 00:18:02 - 14 069,9 PPD using GTX970 OC 1497MHz, ony one GPU slot runs & 3 CPU cores idle. Even core 15's WUs (consisted of 40'000'000 steps)gave around 70-80kppd. Anyone who has membership may post
here?
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Mon Oct 10, 2016 3:40 pm
by WhitehawkEQ
Every GPU WU has:
Code: Select all
15:02:29:WU01:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
regardless of the type of WU.
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Mon Oct 10, 2016 5:00 pm
by Joe_H
Duce H_K_ wrote:My point is that the project consists of 50'000'000 steps while other projects consist of 2'500'000-5'000'000 steps
The number of steps applies per project, so the comparison is only valid if other WU's from the same project normally have a different number of steps compared to the problem WU. Do Project 11430 WUs normally have 50M steps, or is it another number? If the problem WU's that people are reporting here do have an abnormal number of steps compared to other Project 11430 WU's, then that can be reported. So far no one has reported enough about any of the problem WU's to make that determination.
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Mon Oct 10, 2016 5:04 pm
by Joe_H
WhitehawkEQ wrote:Every GPU WU has:
Code: Select all
15:02:29:WU01:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
regardless of the type of WU.
Normal, informational message that can be ignored. It is about a feature added to the folding core that comes disabled by default since it proved problematical during testing. The client doesn't have code to suppress display of the message as the core was released after the 7.4.4 version of the client.
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Mon Oct 10, 2016 7:35 pm
by JimboPalmer
PS3EdOlkkola wrote: The issue was identified on 10/6 and it's now 10/9 without a FF post from the PI providing an update. Lacking an "everything is ok, carry on" message then I default to my own experience and make the determination the work unit is defective and dump it and report it immediately.
Just a heads up, 10/8 to 1010 is a major holiday in Canada: Thanksgiving, and a Federal Holiday in the US: Columbus Day. Tuesday will be a normal workday in both countries.
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Mon Oct 10, 2016 9:18 pm
by davidcoton
11430 is apparently still in beta (unless I missed something). Support is normally not available here. If you need support, join the beta team as described
here (accepting the obligations that entails) and report in the beta forum. Otherwise remove the beta flag and do not run beta projects.
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Mon Oct 10, 2016 11:06 pm
by PS3EdOlkkola
None of my machines processing 11430 work units are configured as beta test rigs. They are not even running the advanced flag. According to the configuration of my systems, 11430 is a general release project.
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Tue Oct 11, 2016 7:36 am
by rwh202
Yeah, non beta project - I don't have beta set on these rigs and the project is in the general psummary.
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Tue Oct 11, 2016 10:45 am
by davidcoton
OK, my mistake. I just wish it was easier to see whether a unit was in beta, advanced, or full release.
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Thu Oct 13, 2016 3:51 pm
by rwh202
Thanks to Duce H_K_ who suggested it, here's a log for a normal 11430 (Gen 50):
Code: Select all
18:29:02:WU02:FS01:Connecting to 171.67.108.45:80
18:29:03:WU02:FS01:Assigned to work server 140.163.4.241
18:29:03:WU02:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GP104 [GeForce GTX 1080] from 140.163.4.241
18:29:03:WU02:FS01:Connecting to 140.163.4.241:8080
18:29:04:WU02:FS01:Downloading 2.28MiB
18:29:07:WU02:FS01:Download complete
18:29:07:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:11430 run:1 clone:30 gen:50 core:0x21 unit:0x000000458ca304f1574a00752a3c4586
18:29:07:WU02:FS01:Starting
18:29:07:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1210 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
18:29:07:WU02:FS01:Started FahCore on PID 21895
18:29:07:WU02:FS01:Core PID:21899
18:29:07:WU02:FS01:FahCore 0x21 started
18:29:08:WU02:FS01:0x21:*********************** Log Started 2016-10-06T18:29:07Z ***********************
18:29:08:WU02:FS01:0x21:Project: 11430 (Run 1, Clone 30, Gen 50)
18:29:08:WU02:FS01:0x21:Unit: 0x000000458ca304f1574a00752a3c4586
18:29:08:WU02:FS01:0x21:CPU: 0x00000000000000000000000000000000
18:29:08:WU02:FS01:0x21:Machine: 1
18:29:08:WU02:FS01:0x21:Reading tar file core.xml
18:29:08:WU02:FS01:0x21:Reading tar file system.xml
18:29:08:WU02:FS01:0x21:Reading tar file integrator.xml
18:29:08:WU02:FS01:0x21:Reading tar file state.xml
18:29:08:WU02:FS01:0x21:Digital signatures verified
18:29:08:WU02:FS01:0x21:Folding@home GPU Core21 Folding@home Core
18:29:08:WU02:FS01:0x21:Version 0.0.17
18:29:11:WU02:FS01:0x21:Completed 0 out of 5000000 steps (0%)
18:29:11:WU02:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
18:30:27:WU02:FS01:0x21:Completed 50000 out of 5000000 steps (1%)
18:31:43:WU02:FS01:0x21:Completed 100000 out of 5000000 steps (2%)
18:32:59:WU02:FS01:0x21:Completed 150000 out of 5000000 steps (3%)
18:34:15:WU02:FS01:0x21:Completed 200000 out of 5000000 steps (4%)
18:35:30:WU02:FS01:0x21:Completed 250000 out of 5000000 steps (5%)
18:36:47:WU02:FS01:0x21:Completed 300000 out of 5000000 steps (6%)
18:38:02:WU02:FS01:0x21:Completed 350000 out of 5000000 steps (7%)
18:39:19:WU02:FS01:0x21:Completed 400000 out of 5000000 steps (8%)
18:40:34:WU02:FS01:0x21:Completed 450000 out of 5000000 steps (9%)
18:41:50:WU02:FS01:0x21:Completed 500000 out of 5000000 steps (10%)
18:43:07:WU02:FS01:0x21:Completed 550000 out of 5000000 steps (11%)
18:44:22:WU02:FS01:0x21:Completed 600000 out of 5000000 steps (12%)
18:45:39:WU02:FS01:0x21:Completed 650000 out of 5000000 steps (13%)
18:46:54:WU02:FS01:0x21:Completed 700000 out of 5000000 steps (14%)
18:48:10:WU02:FS01:0x21:Completed 750000 out of 5000000 steps (15%)
18:49:26:WU02:FS01:0x21:Completed 800000 out of 5000000 steps (16%)
18:50:42:WU02:FS01:0x21:Completed 850000 out of 5000000 steps (17%)
18:51:58:WU02:FS01:0x21:Completed 900000 out of 5000000 steps (18%)
18:53:14:WU02:FS01:0x21:Completed 950000 out of 5000000 steps (19%)
18:54:29:WU02:FS01:0x21:Completed 1000000 out of 5000000 steps (20%)
18:55:46:WU02:FS01:0x21:Completed 1050000 out of 5000000 steps (21%)
18:57:01:WU02:FS01:0x21:Completed 1100000 out of 5000000 steps (22%)
18:58:18:WU02:FS01:0x21:Completed 1150000 out of 5000000 steps (23%)
18:59:33:WU02:FS01:0x21:Completed 1200000 out of 5000000 steps (24%)
19:00:49:WU02:FS01:0x21:Completed 1250000 out of 5000000 steps (25%)
19:02:05:WU02:FS01:0x21:Completed 1300000 out of 5000000 steps (26%)
19:03:21:WU02:FS01:0x21:Completed 1350000 out of 5000000 steps (27%)
19:04:38:WU02:FS01:0x21:Completed 1400000 out of 5000000 steps (28%)
19:05:53:WU02:FS01:0x21:Completed 1450000 out of 5000000 steps (29%)
19:07:09:WU02:FS01:0x21:Completed 1500000 out of 5000000 steps (30%)
19:08:25:WU02:FS01:0x21:Completed 1550000 out of 5000000 steps (31%)
19:09:41:WU02:FS01:0x21:Completed 1600000 out of 5000000 steps (32%)
19:10:57:WU02:FS01:0x21:Completed 1650000 out of 5000000 steps (33%)
19:12:12:WU02:FS01:0x21:Completed 1700000 out of 5000000 steps (34%)
19:13:28:WU02:FS01:0x21:Completed 1750000 out of 5000000 steps (35%)
19:14:44:WU02:FS01:0x21:Completed 1800000 out of 5000000 steps (36%)
19:16:00:WU02:FS01:0x21:Completed 1850000 out of 5000000 steps (37%)
19:17:16:WU02:FS01:0x21:Completed 1900000 out of 5000000 steps (38%)
19:18:32:WU02:FS01:0x21:Completed 1950000 out of 5000000 steps (39%)
19:19:47:WU02:FS01:0x21:Completed 2000000 out of 5000000 steps (40%)
19:21:04:WU02:FS01:0x21:Completed 2050000 out of 5000000 steps (41%)
19:22:19:WU02:FS01:0x21:Completed 2100000 out of 5000000 steps (42%)
19:23:36:WU02:FS01:0x21:Completed 2150000 out of 5000000 steps (43%)
19:24:51:WU02:FS01:0x21:Completed 2200000 out of 5000000 steps (44%)
19:26:07:WU02:FS01:0x21:Completed 2250000 out of 5000000 steps (45%)
19:27:23:WU02:FS01:0x21:Completed 2300000 out of 5000000 steps (46%)
19:28:38:WU02:FS01:0x21:Completed 2350000 out of 5000000 steps (47%)
19:29:55:WU02:FS01:0x21:Completed 2400000 out of 5000000 steps (48%)
19:31:10:WU02:FS01:0x21:Completed 2450000 out of 5000000 steps (49%)
19:32:26:WU02:FS01:0x21:Completed 2500000 out of 5000000 steps (50%)
19:33:42:WU02:FS01:0x21:Completed 2550000 out of 5000000 steps (51%)
19:34:57:WU02:FS01:0x21:Completed 2600000 out of 5000000 steps (52%)
19:36:14:WU02:FS01:0x21:Completed 2650000 out of 5000000 steps (53%)
19:37:29:WU02:FS01:0x21:Completed 2700000 out of 5000000 steps (54%)
19:38:45:WU02:FS01:0x21:Completed 2750000 out of 5000000 steps (55%)
19:40:01:WU02:FS01:0x21:Completed 2800000 out of 5000000 steps (56%)
19:41:17:WU02:FS01:0x21:Completed 2850000 out of 5000000 steps (57%)
19:42:33:WU02:FS01:0x21:Completed 2900000 out of 5000000 steps (58%)
19:43:49:WU02:FS01:0x21:Completed 2950000 out of 5000000 steps (59%)
19:45:04:WU02:FS01:0x21:Completed 3000000 out of 5000000 steps (60%)
19:46:21:WU02:FS01:0x21:Completed 3050000 out of 5000000 steps (61%)
19:47:36:WU02:FS01:0x21:Completed 3100000 out of 5000000 steps (62%)
19:48:53:WU02:FS01:0x21:Completed 3150000 out of 5000000 steps (63%)
19:50:08:WU02:FS01:0x21:Completed 3200000 out of 5000000 steps (64%)
19:51:24:WU02:FS01:0x21:Completed 3250000 out of 5000000 steps (65%)
19:52:41:WU02:FS01:0x21:Completed 3300000 out of 5000000 steps (66%)
19:53:56:WU02:FS01:0x21:Completed 3350000 out of 5000000 steps (67%)
19:55:13:WU02:FS01:0x21:Completed 3400000 out of 5000000 steps (68%)
19:56:28:WU02:FS01:0x21:Completed 3450000 out of 5000000 steps (69%)
19:57:44:WU02:FS01:0x21:Completed 3500000 out of 5000000 steps (70%)
19:59:00:WU02:FS01:0x21:Completed 3550000 out of 5000000 steps (71%)
20:00:15:WU02:FS01:0x21:Completed 3600000 out of 5000000 steps (72%)
20:01:33:WU02:FS01:0x21:Completed 3650000 out of 5000000 steps (73%)
20:02:48:WU02:FS01:0x21:Completed 3700000 out of 5000000 steps (74%)
20:04:04:WU02:FS01:0x21:Completed 3750000 out of 5000000 steps (75%)
20:05:21:WU02:FS01:0x21:Completed 3800000 out of 5000000 steps (76%)
20:06:36:WU02:FS01:0x21:Completed 3850000 out of 5000000 steps (77%)
20:07:53:WU02:FS01:0x21:Completed 3900000 out of 5000000 steps (78%)
20:09:08:WU02:FS01:0x21:Completed 3950000 out of 5000000 steps (79%)
20:10:24:WU02:FS01:0x21:Completed 4000000 out of 5000000 steps (80%)
20:11:40:WU02:FS01:0x21:Completed 4050000 out of 5000000 steps (81%)
20:12:56:WU02:FS01:0x21:Completed 4100000 out of 5000000 steps (82%)
20:14:12:WU02:FS01:0x21:Completed 4150000 out of 5000000 steps (83%)
20:15:27:WU02:FS01:0x21:Completed 4200000 out of 5000000 steps (84%)
20:16:43:WU02:FS01:0x21:Completed 4250000 out of 5000000 steps (85%)
20:17:59:WU02:FS01:0x21:Completed 4300000 out of 5000000 steps (86%)
20:19:15:WU02:FS01:0x21:Completed 4350000 out of 5000000 steps (87%)
20:20:31:WU02:FS01:0x21:Completed 4400000 out of 5000000 steps (88%)
20:21:46:WU02:FS01:0x21:Completed 4450000 out of 5000000 steps (89%)
20:23:02:WU02:FS01:0x21:Completed 4500000 out of 5000000 steps (90%)
20:24:18:WU02:FS01:0x21:Completed 4550000 out of 5000000 steps (91%)
20:25:34:WU02:FS01:0x21:Completed 4600000 out of 5000000 steps (92%)
20:26:50:WU02:FS01:0x21:Completed 4650000 out of 5000000 steps (93%)
20:28:05:WU02:FS01:0x21:Completed 4700000 out of 5000000 steps (94%)
20:29:21:WU02:FS01:0x21:Completed 4750000 out of 5000000 steps (95%)
20:30:37:WU02:FS01:0x21:Completed 4800000 out of 5000000 steps (96%)
20:31:53:WU02:FS01:0x21:Completed 4850000 out of 5000000 steps (97%)
20:33:09:WU02:FS01:0x21:Completed 4900000 out of 5000000 steps (98%)
20:34:25:WU02:FS01:0x21:Completed 4950000 out of 5000000 steps (99%)
20:35:40:WU02:FS01:0x21:Completed 5000000 out of 5000000 steps (100%)
20:35:41:WU02:FS01:0x21:Saving result file logfile_01.txt
20:35:41:WU02:FS01:0x21:Saving result file checkpointState.xml
20:35:42:WU02:FS01:0x21:Saving result file checkpt.crc
20:35:42:WU02:FS01:0x21:Saving result file log.txt
20:35:42:WU02:FS01:0x21:Saving result file positions.xtc
20:35:42:WU02:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
20:35:43:WU02:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
20:35:43:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:11430 run:1 clone:30 gen:50 core:0x21 unit:0x000000458ca304f1574a00752a3c4586
20:35:43:WU02:FS01:Uploading 6.04MiB to 140.163.4.241
20:35:43:WU02:FS01:Connecting to 140.163.4.241:8080
20:35:49:WU02:FS01:Upload 7.24%
20:35:55:WU02:FS01:Upload 16.54%
20:36:01:WU02:FS01:Upload 25.85%
20:36:07:WU02:FS01:Upload 34.12%
20:36:13:WU02:FS01:Upload 43.43%
20:36:19:WU02:FS01:Upload 52.74%
20:36:25:WU02:FS01:Upload 62.04%
20:36:31:WU02:FS01:Upload 70.31%
20:36:38:WU02:FS01:Upload 79.62%
20:36:44:WU02:FS01:Upload 88.93%
20:36:50:WU02:FS01:Upload 97.20%
20:36:56:WU02:FS01:Upload complete
20:36:56:WU02:FS01:Server responded WORK_ACK (400)
20:36:56:WU02:FS01:Final credit estimate, 66170.00 points
20:36:56:WU02:FS01:Cleaning up
Compare this with the logs posted for the troublesome Gen51 that show 50,000,000 steps instead of 5,000,000. The config generating these WUs needs looking at.
Re: Project 11430 taking 13 hours for 24k credit.
Posted: Thu Oct 13, 2016 4:07 pm
by Joe_H
You are the first to post a log for a normal WU from Project 11430 to compare with the 2 logs actually posted. The original poster at the beginning did include a log, Duce H_K is the only other person who did. I will forward this information to the person running this project.
As for whether it is related to Gen numbers over 50, there appear to be a number of WU's in the database that show completion with normal PPD and have such a Gen number over 50.