Project 11430 taking 13 hours for 24k credit.

Kevincav · Post by **Kevincav** » Fri Oct 07, 2016 12:16 am

My GPU is currently running project 11430 with a scheduled estimated time of 13+ hours. Has anyone run into this issue as well? On a side note, I'm getting the tmax issue (seen in these logs) even though I manually set those. The GPU itself is never getting over 78C. Thanks for the help.

Code: Select all

21:34:15:WU00:FS01:0x21:*********************** Log Started 2016-10-06T21:34:14Z ***********************
21:34:15:WU00:FS01:0x21:Project: 11430 (Run 2, Clone 47, Gen 51)
21:34:15:WU00:FS01:0x21:Unit: 0x000000468ca304f1574a007e80ac26f4
21:34:15:WU00:FS01:0x21:CPU: 0x00000000000000000000000000000000
21:34:15:WU00:FS01:0x21:Machine: 1
21:34:15:WU00:FS01:0x21:Reading tar file core.xml
21:34:15:WU00:FS01:0x21:Reading tar file system.xml
21:34:15:WU00:FS01:0x21:Reading tar file integrator.xml
21:34:15:WU00:FS01:0x21:Reading tar file state.xml
21:34:15:WU00:FS01:0x21:Digital signatures verified
21:34:15:WU00:FS01:0x21:Folding@home GPU Core21 Folding@home Core
21:34:15:WU00:FS01:0x21:Version 0.0.17
21:34:24:WU00:FS01:0x21:Completed 0 out of 50000000 steps (0%)
21:34:24:WU00:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
21:43:51:WU00:FS01:0x21:Completed 500000 out of 50000000 steps (1%)
21:53:19:WU00:FS01:0x21:Completed 1000000 out of 50000000 steps (2%)
22:02:48:WU00:FS01:0x21:Completed 1500000 out of 50000000 steps (3%)
22:12:16:WU00:FS01:0x21:Completed 2000000 out of 50000000 steps (4%)
22:21:44:WU00:FS01:0x21:Completed 2500000 out of 50000000 steps (5%)
22:31:13:WU00:FS01:0x21:Completed 3000000 out of 50000000 steps (6%)
22:40:41:WU00:FS01:0x21:Completed 3500000 out of 50000000 steps (7%)
22:50:11:WU00:FS01:0x21:Completed 4000000 out of 50000000 steps (8%)
22:59:40:WU00:FS01:0x21:Completed 4500000 out of 50000000 steps (9%)
23:09:09:WU00:FS01:0x21:Completed 5000000 out of 50000000 steps (10%)
23:18:37:WU00:FS01:0x21:Completed 5500000 out of 50000000 steps (11%)
23:28:06:WU00:FS01:0x21:Completed 6000000 out of 50000000 steps (12%)
23:37:37:WU00:FS01:0x21:Completed 6500000 out of 50000000 steps (13%)
******************************* Date: 2016-10-06 *******************************
23:47:11:WU00:FS01:0x21:Completed 7000000 out of 50000000 steps (14%)
23:56:44:WU00:FS01:0x21:Completed 7500000 out of 50000000 steps (15%)
00:06:19:WU00:FS01:0x21:Completed 8000000 out of 50000000 steps (16%)

Code: Select all

*********************** Log Started 2016-10-06T05:39:02Z ***********************
05:39:02:************************* Folding@home Client *************************
05:39:02:      Website: http://folding.stanford.edu/
05:39:02:    Copyright: (c) 2009-2014 Stanford University
05:39:02:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
05:39:02:         Args: 
05:39:02:       Config: C:/Users/Kevin/AppData/Roaming/FAHClient/config.xml
05:39:02:******************************** Build ********************************
05:39:02:      Version: 7.4.4
05:39:02:         Date: Mar 4 2014
05:39:02:         Time: 20:26:54
05:39:02:      SVN Rev: 4130
05:39:02:       Branch: fah/trunk/client
05:39:02:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
05:39:02:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
05:39:02:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
05:39:02:     Platform: win32 XP
05:39:02:         Bits: 32
05:39:02:         Mode: Release
05:39:02:******************************* System ********************************
05:39:02:          CPU: Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
05:39:02:       CPU ID: GenuineIntel Family 6 Model 79 Stepping 1
05:39:02:         CPUs: 32
05:39:02:       Memory: 255.88GiB
05:39:02:  Free Memory: 246.92GiB
05:39:02:      Threads: WINDOWS_THREADS
05:39:02:   OS Version: 6.2
05:39:02:  Has Battery: false
05:39:02:   On Battery: false
05:39:02:   UTC Offset: -7
05:39:02:          PID: 14836
05:39:02:          CWD: C:/Users/Kevin/AppData/Roaming/FAHClient
05:39:02:           OS: Windows 10 Pro
05:39:02:      OS Arch: AMD64
05:39:02:         GPUs: 2
05:39:02:        GPU 0: NVIDIA:5 GP102 [GeForce Titan X]
05:39:02:        GPU 1: NVIDIA:5 GP102 [GeForce Titan X]
05:39:02:         CUDA: 6.1
05:39:02:  CUDA Driver: 8000
05:39:02:Win32 Service: false
05:39:02:***********************************************************************
05:39:02:<config>
05:39:02:  <!-- Folding Core -->
05:39:02:  <core-priority v='low'/>
05:39:02:
05:39:02:  <!-- Folding Slot Configuration -->
05:39:02:  <cause v='CANCER'/>
05:39:02:
05:39:02:  <!-- Network -->
05:39:02:  <proxy v=':8080'/>
05:39:02:
05:39:02:  <!-- Slot Control -->
05:39:02:  <power v='full'/>
05:39:02:
05:39:02:  <!-- User Information -->
05:39:02:  <passkey v='********************************'/>
05:39:02:  <team v='231300'/>
05:39:02:  <user v='Kevincav'/>
05:39:02:
05:39:02:  <!-- Folding Slots -->
05:39:02:  <slot id='0' type='CPU'>
05:39:02:    <client-type v='beta'/>
05:39:02:    <cpus v='32'/>
05:39:02:  </slot>
05:39:02:  <slot id='1' type='GPU'>
05:39:02:    <client-type v='advanced'/>
05:39:02:    <extra-core-args v='-tmax=90 -twait=900'/>
05:39:02:  </slot>
05:39:02:  <slot id='2' type='GPU'>
05:39:02:    <client-type v='advanced'/>
05:39:02:    <extra-core-args v='-tmax=90 -twait=900'/>
05:39:02:  </slot>
05:39:02:  <slot id='3' type='CPU'>
05:39:02:    <client-type v='beta'/>
05:39:02:    <cpus v='32'/>
05:39:02:  </slot>
05:39:02:  <slot id='4' type='CPU'>
05:39:02:    <client-type v='beta'/>
05:39:02:    <cpus v='10'/>
05:39:02:  </slot>
05:39:02:  <slot id='5' type='CPU'>
05:39:02:    <client-type v='beta'/>
05:39:02:    <cpus v='12'/>
05:39:02:  </slot>
05:39:02:</config>
05:39:02:Trying to access database...
05:39:02:Successfully acquired database lock

JimboPalmer · Post by **JimboPalmer** » Fri Oct 07, 2016 12:57 am

I wish you well.

Kevincav · Post by **Kevincav** » Fri Oct 07, 2016 1:49 am

JimboPalmer wrote:Your log does not show what GPU you have, what OS you are running how much memory your PC has, what CPU you have, if you are also folding on the CPU, what version of Folding at Home you are running. Nor did you mention any of that, yourself.

I wish you well.

Just added that sorry. Actually I tried to stop it and successfully got a new one just for it to go back to the 13 hour version.

ChristianVirtual · Post by **ChristianVirtual** » Fri Oct 07, 2016 2:18 am

Are you running all those slots ? Also the CPU slots ? Or will they be paused ?

If they all run you quite over allocate system resources.

Keep in mind that each GPU also require at least one CPU thread permanent around. If you have 32 Cores/Hyperthreads you should only allocate up to CPU:30.

Suggestion:
remove all CPU slots 3, 4 & 5
Reduce slot 0 to CPU:30 (or even 24)
Try once more the GPUs performance increase

Kevincav · Post by **Kevincav** » Fri Oct 07, 2016 2:24 am

ChristianVirtual wrote:Are you running all those slots ? Also the CPU slots ? Or will they be paused ?

If they all run you quite over allocate system resources.

Keep in mind that each GPU also require at least one CPU thread permanent around. If you have 32 Cores/Hyperthreads you should only allocate up to CPU:30.

Suggestion:
remove all CPU slots 3, 4 & 5
Reduce slot 0 to CPU:30 (or even 24)
Try once more the GPUs performance increase

Yeah, I'm running all slots. That adds up though. I have 2 spots open for the gpus. 32 + 32 + 12 + 10 = 86 out of the 88 threads. Leaving one open for each GPU.

TBD on the 3,4,5 slots being removed. I want it to finish current WUs.

Edit: I see why you think I'm over-allocating. It's only showing 1 cpu in the lots and only 32 threads. My system actually has 2 2696 v4s with 44 threads each.

ChristianVirtual · Post by **ChristianVirtual** » Fri Oct 07, 2016 2:31 am

Looking in the log you have "only"
05:39:02: CPUs: 32

Does non-server Win support higher core count ? (Sorry, Mac/Linux guy here)

Kevincav · Post by **Kevincav** » Fri Oct 07, 2016 2:35 am

Yes. 256 cores. I'm going to be switching back to linux but my password manager apparently doesn't work there. So I have to reupdate chromes passwords before I do.

Post by **bruce** » Fri Oct 07, 2016 3:22 am

Kevincav wrote:TBD on the 3,4,5 slots being removed. I want it to finish current WUs.

That's an unnecessary precaution. When you delete a slot which is working on a WU, one of two things happens.
1) If you still have a slot which is similar enough to run the WU, the WU will be transferred to that slot and it'll be processed as soon as the current WU in that slot finishes.
2) If there are no such similar slots, the WU will be aborted.

I'd certainly set any slots that exceed your final planned slot count to FINISH so that they don't download new WUs.

I would delete at least one of the CPU slots That will temporarily suspend the processing of one WU and enqueue it behind the WU in the first CPU slot. As an alternative (or in addition to that) you can pause enough slots or reduce the numbers of CPUs allocated so that the total does not exceed 32 (less two for the GPUs). When FAH is running one process per CPU, it is significantly more efficient than when they're over-allocated.

If you delete all the slots but, say 2, almost everything will be enqueued on slot 0 except the current WU which running in the second CPU slot.

(The only problem with manually pausing some slots is that you have to remember to unpause, then set Finish, when something else finishes.)

Also, I would avoid any changes which causes a WU that was assigned to a slot with a small CPU count to a slot with a large CPU count.

Personally, I'd plan to end up with one slot for CPU:24 and another for CPU:6 -- assuming the total remains at 32.

Post by **Joe_H** » Fri Oct 07, 2016 5:40 am

I have run into mentions of limits on how many cores will be recognized by a single process or app, even though the Windows OS version supports more. I used to be able to find documentation of these limits on MS's site, but have not been digging through their pages much recently.

ChristianVirtual · Post by **ChristianVirtual** » Fri Oct 07, 2016 10:41 am

"Processor Groups" seems to be the reason :

https://social.technet.microsoft.com/Fo ... nservergen

https://msdn.microsoft.com/en-us/librar ... 98(v=vs.85).aspx

If the same limit of 64 per group is still valid on W10 that would explain.

[talk to myself]
Follow up question though would be: each FAH core is a process, isn't it ? Why that is an issue ?
Possible answer: FAH cores are child process and share the group of parent.
[/talk to myself]

PS3EdOlkkola · Post by **PS3EdOlkkola** » Fri Oct 07, 2016 2:00 pm

There is an issue with project 11430. I've identified at least three work units that are problematic, meaning the time per frame is taking far longer than it should. Each work unit was on a different machine, GPU clocks had not dropped to 2D performance. Diagnostics included pausing slot, restarting FAH, then rebooted each machine. No change in performance after the reboot on all three machines and each of these work units continued to run extremely slow (example of slow: TPF of 10:20 on a Titan XP, resulting in 32,437 PPD)

They are:

11430 (R2, C13, G51)
11430 (R2, C36, G51)
11430 (R0, C41, G51)

I've dumped the first and third work units. The second one is almost finished, 11430 (R2, C36, G51) and I'll let it upload to see if it can be diagnosed.

Kevincav · Post by **Kevincav** » Sat Oct 08, 2016 2:59 am

That's interesting on the cores limit. Guess I'm switching over to linux for this. It would suck but I can't imagine that a whole lot of W10 users are using dual 22 core xeons in their machines

.

@PS3, Thanks for that info / help with it. On a side note, how are you earning so many points every day? What are you running?

PS3EdOlkkola · Post by **PS3EdOlkkola** » Sat Oct 08, 2016 5:21 am

I'm running 17 systems that host 42 GPU folding slots and 1 CPU folding slot. The GPUs consist of 7 Titan XP, 11 Titan X, 8 GTX 1080, 10 GTX 980ti, 3 GTX 980, and 3 Fury X. The majority of the CPUs are 4 core/8 thread Xeon E5 v3 series running 40 PCIe lanes. I'm in the process of consolidating the non-2011 systems to 2011-v3 or v4 motherboards with the goal of settling on 3 GPUs per system. I've tested up to 6 GPUs per system and it seems like 3 high performance GPUs is where the sweet-spot is between maximum PPD, manageability, and heat extraction. All systems are in 4u Chenbro (Tesla model) steel server chassis that are heavily modified for improved airflow (I've improved my jigsaw-on-steel skill quite a bit

. The systems are split between two physical locations with dual Internet connections at each with routers that auto-failover to the backup connection. Each system is connected by Netgear managed gigabit switches. All the rigs are mounted in two 7' racks and two 4' racks. There are six 20 amp circuits, with each circuit powering 3 systems, with the power draw between 13 and 17 amps on average on each circuit. The single CPU folding slot is an Intel Xeon Phi 7210 with 64 cores/256 threads running CentOS 7 Linux that I'm using for beta testing FAH cores, and depending on the results, might cause me to upgrade the 2011 v3/v4 CPUs to higher count core/thread CPUs. All power supplies are Platinum or better rated, but power is still a significant operating cost; a little more in the summer (due to A/C load) and less in the winter, but Texas has relatively low power rates.

I don't generally run much of an overclock on the GPUs. As long as there isn't a problem with a work unit -- or series of them -- the systems don't require very much maintenance at all except for Windows weekly updates (all GPU slots use Windows 7 x64 as the OS for the motherboard, all using SSDs). Went on a week-long business trip recently and all the systems ran flawlessly when I was away. Basic management tools are HFM, VNC and Afterburner. I monitor temperatures in the server room locations using Acu-rite wireless thermostats. Glad to answer any specific questions if you want to email me at PS3EdOlkkola@gmail.com

Kevincav · Post by **Kevincav** » Sat Oct 08, 2016 7:22 am

Wow, nice response on the gear, I'm sending an email right now.

PS3EdOlkkola · Post by **PS3EdOlkkola** » Sat Oct 08, 2016 2:23 pm

Project 11430 continues to have issues. Same problem as mentioned 4 posts ago in this thread. Problematic work units are:

11430 (R3, C40, G51) PPD 24657 TPF 12:24
11430 (R0, C16, G51) PPD 19348 TPF 14:35

Both work units were on different systems. Each system was equipped with a 980ti. Normal PPD for each 980ti is in the range of 575 to 700 PPD. The work units were dumped.

Folding Forum

Project 11430 taking 13 hours for 24k credit.

Project 11430 taking 13 hours for 24k credit.

Re: Project 11430 taking 13 hours for 24k credit.

Re: Project 11430 taking 13 hours for 24k credit.

Re: Project 11430 taking 13 hours for 24k credit.

Re: Project 11430 taking 13 hours for 24k credit.

Re: Project 11430 taking 13 hours for 24k credit.

Re: Project 11430 taking 13 hours for 24k credit.

Re: Project 11430 taking 13 hours for 24k credit.

Re: Project 11430 taking 13 hours for 24k credit.

Re: Project 11430 taking 13 hours for 24k credit.

Re: Project 11430 taking 13 hours for 24k credit.

Re: Project 11430 taking 13 hours for 24k credit.

Re: Project 11430 taking 13 hours for 24k credit.

Re: Project 11430 taking 13 hours for 24k credit.

Re: Project 11430 taking 13 hours for 24k credit.