Page 1 of 4
Any idea how many Core 16's are left?
Posted: Tue Sep 30, 2014 4:37 pm
by ChasingTheDream
I can see the Core 16's are being sent out again. I could tell immediately because 4 of my 8 machines was down when I got home. They simply won't run Core 16's and I reduced the GPU clock speeds by 40% with no effect. I've seen these issues before and I actually had a whole thread about it but it was for Core 17 at the time and the only thing that fixed it was a AMD driver update.
So I'm experiencing all the same things talked about here but for Core 16 this time.
viewtopic.php?f=61&t=26421&p=265530&hilit=chasingthedream#p265530
If I restart 4 machines in sequence, by the time I've got the 4th machine restarted the 1st machine has failed again. It is pointless for me to try to run Core 16's on any multi-GPU system. They will never finish. They will run on my single GPU systems and that is the exact same situation described in the thread above. So I'm in a bad situation again that appears to have no solution other than shutting down the multi-GPU systems until all the Core 16's are processed or automate the Core 16 WU dumping on my multi-GPU systems because they won't finish anyway.
I don't understand why the client doesn't allow us to filter core's that will obviously not run on our hardware. If PPD are an issue then make the Core 16's provide the same points as the Core 17's so nobody can complain about it and the work needs to get done regardless of points.
In any event, I'm trying to figure out how long I'll have to jump through hoops to get through the Core 16's. Any idea how many Core 16's are left to process? The answer will determine what steps I take next.
Also what is my WU completion percentage since I may end up dumping quite a few soon? My folding name is ChasingTheDream.
Also to be clear, I don't care in the least about the number of points the Core 16's produce. What is getting under my skin is the fact that I can't keep my machines running for more than a few minutes while Core 16's are processing.
Re: Any idea how many Core 16's are left?
Posted: Tue Sep 30, 2014 4:51 pm
by bruce
Please post your system configuration. We need to know which version of the client you are running? and which type of GPU do you have.
Also, please post a segment of the log showing the IP addresses of the servers involved during the assignment process.
Stanford is aware that Donors do not like Core_16 assignments and some have trouble with specific versions of OpenCL. Nevertheless, the priority of Core_16 should be set low enough that you will only be assigned it if your system is unable to get an assignment of Core_17 projects.
We need to figure out why you're not getting Core_17 and that isn't a simple process when you don't provide enough information. (See the link below.)
Re: Any idea how many Core 16's are left?
Posted: Tue Sep 30, 2014 5:22 pm
by 7im
ChasingTheDream wrote:...snip...
They will run on my single GPU systems and that is the exact same situation described in the thread above.
I don't understand why the client doesn't allow us to filter core's that will obviously not run on our hardware.
Also what is my WU completion percentage since I may end up dumping quite a few soon?
They will run on single GPU systems? Sounds like FAH works just fine to me. If not running on multi-GPU systems, that's a driver issue, not a FAH issue. And FAH can't afford to design around every little driver glitch. That's a never ending battle, and we'd never see a new fah client.
The total number of core_16 work units is not known. If it was, they would have stated that when the EOL was announced.
They are not going to give you a completion percentage in order to allow you to drop work units.
Are you running the latest AMD drivers? At least 14.4 or newer?
Re: Any idea how many Core 16's are left?
Posted: Tue Sep 30, 2014 5:31 pm
by ChasingTheDream
bruce wrote:Please post your system configuration. We need to know which version of the client you are running? and which type of GPU do you have.
Also, please post a segment of the log showing the IP addresses of the servers involved during the assignment process.
Stanford is aware that Donors do not like Core_16 assignments and some have trouble with specific versions of OpenCL. Nevertheless, the priority of Core_16 should be set low enough that you will only be assigned it if your system is unable to get an assignment of Core_17 projects.
We need to figure out why you're not getting Core_17 and that isn't a simple process when you don't provide enough information. (See the link below.)
I'm on the 7.4.4 client. I've dumped all the Core 16 WU's from the multi-GPU systems so they are running along just fine now. When I get another Core 16 assignment the machine will either lock or BSD crash within a few minutes. The temps on the GPU's are no higher than the Core 17's. I've watched them. In any event, I'll try to get some logs from the next crash when I get one and post them. We've seen this pattern before though in the thread I mentioned previously. Literally all the things to try did not make any difference. A driver update allowed the Core 17's to be stable. I suspect this is the same situation because the behavior is identical.
In any event, I'll wait for the next Core 16 WU and send you the crash logs.
I missed the last part of your message. I do get Core 17 WU's most of the time. In fact, when I dropped the Core 16 WU's they were replaced with Core 17 WU's. It only takes one Core 16 WU to drop my multi-GPU systems like a hot potato though!
Re: Any idea how many Core 16's are left?
Posted: Tue Sep 30, 2014 5:35 pm
by ChasingTheDream
7im wrote:ChasingTheDream wrote:...snip...
They will run on my single GPU systems and that is the exact same situation described in the thread above.
I don't understand why the client doesn't allow us to filter core's that will obviously not run on our hardware.
Also what is my WU completion percentage since I may end up dumping quite a few soon?
They will run on single GPU systems? Sounds like FAH works just fine to me. If not running on multi-GPU systems, that's a driver issue, not a FAH issue. And FAH can't afford to design around every little driver glitch. That's a never ending battle, and we'd never see a new fah client.
The total number of core_16 work units is not known. If it was, they would have stated that when the EOL was announced.
They are not going to give you a completion percentage in order to allow you to drop work units.
Are you running the latest AMD drivers? At least 14.4 or newer?
I completely agree. Prior to AMD 14.7RC3 (which I'm on now) I couldn't run Core 17 on multi-GPU systems. It was a driver issue all along. The thread I posted previously shows quite a few things to try to no avail. I suspect this is the same issue. I can't drop back a driver level or the Core 17's won't run.
This is why I said I've got two options. Shut the multi-GPU systems down or drop the Core 16 WU's. It would be nice if the client allowed us to exclude a core so when we do run into driver issues such as this we don't have to go to great lengths to get around it.
From what I saw before they used to tell you your completion percentage when you asked. I read that while searching the threads before. If they don't then I guess they don't, but I have few options and as of right now the Core 16 WU's are getting dropped after they crash the system three times, which essentially means 90% of the core 16 WU's that hit a multi-GPU system are going to get dropped. I've done over 8500 WU's so I suspect I have some room to spare.
Re: Any idea how many Core 16's are left?
Posted: Tue Sep 30, 2014 7:26 pm
by ChasingTheDream
Bruce, This is from the logs after a crash. I don't know if you are going to need additional logs from a directory. I thought I had a Core 16 make it without incident but it looks like it failed on the send anyway. There is another core 16 running on this machine which is a multi-GPU machine. Unfortunately, I got another Core 16 WU to replace the WU that was sent so I'm guessing the odds of either of the Core 16's actually completing are quite small.
Code: Select all
*********************** Log Started 2014-09-30T19:18:42Z ***********************
19:18:42:************************* Folding@home Client *************************
19:18:42: Website: http://folding.stanford.edu/
19:18:42: Copyright: (c) 2009-2014 Stanford University
19:18:42: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
19:18:42: Args:
19:18:42: Config: C:/Users/Folder4/AppData/Roaming/FAHClient/config.xml
19:18:42:******************************** Build ********************************
19:18:42: Version: 7.4.4
19:18:42: Date: Mar 4 2014
19:18:42: Time: 20:26:54
19:18:42: SVN Rev: 4130
19:18:42: Branch: fah/trunk/client
19:18:42: Compiler: Intel(R) C++ MSVC 1500 mode 1200
19:18:42: Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
19:18:42: /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
19:18:42: Platform: win32 XP
19:18:42: Bits: 32
19:18:42: Mode: Release
19:18:42:******************************* System ********************************
19:18:42: CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
19:18:42: CPU ID: GenuineIntel Family 6 Model 60 Stepping 3
19:18:42: CPUs: 8
19:18:42: Memory: 15.95GiB
19:18:42: Free Memory: 14.75GiB
19:18:42: Threads: WINDOWS_THREADS
19:18:42: OS Version: 6.1
19:18:42: Has Battery: false
19:18:42: On Battery: false
19:18:42: UTC Offset: -5
19:18:42: PID: 2744
19:18:42: CWD: C:/Users/Folder4/AppData/Roaming/FAHClient
19:18:42: OS: Windows 7 Home Premium
19:18:42: OS Arch: AMD64
19:18:42: GPUs: 3
19:18:42: GPU 0: ATI:5 Hawaii [Radeon R9 200X Series]
19:18:42: GPU 1: ATI:5 Hawaii [Radeon R9 200X Series]
19:18:42: GPU 2: ATI:5 Hawaii [Radeon R9 200X Series]
19:18:42: CUDA: Not detected
19:18:42:Win32 Service: false
19:18:42:***********************************************************************
19:18:42:<config>
19:18:42: <!-- Network -->
19:18:42: <proxy v=':8080'/>
19:18:42:
19:18:42: <!-- Slot Control -->
19:18:42: <power v='full'/>
19:18:42:
19:18:42: <!-- User Information -->
19:18:42: <passkey v='********************************'/>
19:18:42: <team v='224497'/>
19:18:42: <user v='ChasingTheDream'/>
19:18:42:
19:18:42: <!-- Folding Slots -->
19:18:42: <slot id='0' type='CPU'>
19:18:42: <cpus v='4'/>
19:18:42: </slot>
19:18:42: <slot id='1' type='GPU'/>
19:18:42: <slot id='2' type='GPU'/>
19:18:42: <slot id='3' type='GPU'/>
19:18:42:</config>
19:18:42:Trying to access database...
19:18:42:Successfully acquired database lock
19:18:42:Enabled folding slot 00: READY cpu:4
19:18:42:Enabled folding slot 01: READY gpu:0:Hawaii [Radeon R9 200X Series]
19:18:42:Enabled folding slot 02: READY gpu:1:Hawaii [Radeon R9 200X Series]
19:18:42:Enabled folding slot 03: READY gpu:2:Hawaii [Radeon R9 200X Series]
19:18:42:WU00:FS01:Starting
19:18:42:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_16.fah/FahCore_16.exe -dir 00 -suffix 01 -version 704 -lifeline 2744 -checkpoint 15 -gpu 0 -gpu-vendor ati
19:18:42:WU00:FS01:Started FahCore on PID 3484
19:18:42:WU00:FS01:Core PID:3496
19:18:42:WU00:FS01:FahCore 0x16 started
19:18:42:WU01:FS02:Starting
19:18:42:WU01:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 01 -suffix 01 -version 704 -lifeline 2744 -checkpoint 15 -gpu 1 -gpu-vendor ati
19:18:42:WU01:FS02:Started FahCore on PID 3516
19:18:42:WU01:FS02:Core PID:3532
19:18:42:WU01:FS02:FahCore 0x17 started
19:18:42:WU04:FS00:Starting
19:18:42:WU04:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/Core_a3.fah/FahCore_a3.exe -dir 04 -suffix 01 -version 704 -lifeline 2744 -checkpoint 15 -np 4
19:18:42:WU04:FS00:Started FahCore on PID 3544
19:18:42:WU04:FS00:Core PID:3564
19:18:42:WU04:FS00:FahCore 0xa3 started
19:18:42:WU03:FS03:Starting
19:18:42:WU03:FS03:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 03 -suffix 01 -version 704 -lifeline 2744 -checkpoint 15 -gpu 2 -gpu-vendor ati
19:18:42:WU03:FS03:Started FahCore on PID 3580
19:18:42:WU03:FS03:Core PID:3596
19:18:42:WU03:FS03:FahCore 0x17 started
19:18:43:WU00:FS01:0x16:
19:18:43:WU00:FS01:0x16:*------------------------------*
19:18:43:WU00:FS01:0x16:Folding@Home GPU Core
19:18:43:WU00:FS01:0x16:Version 2.11 (Thu Dec 9 15:00:14 PST 2010)
19:18:43:WU00:FS01:0x16:
19:18:43:WU00:FS01:0x16:Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86
19:18:43:WU00:FS01:0x16:Build host: user-f6d030f24f
19:18:43:WU00:FS01:0x16:Board Type: AMD/OpenCL
19:18:43:WU00:FS01:0x16:Core : x=16
19:18:43:WU00:FS01:0x16: Window's signal control handler registered.
19:18:43:WU00:FS01:0x16:Preparing to commence simulation
19:18:43:WU00:FS01:0x16:- Ensuring status. Please wait.
19:18:43:WU01:FS02:0x17:*********************** Log Started 2014-09-30T19:18:42Z ***********************
19:18:43:WU01:FS02:0x17:Project: 13000 (Run 1007, Clone 0, Gen 80)
19:18:43:WU01:FS02:0x17:Unit: 0x00000084538b3db75310b8069f489526
19:18:43:WU01:FS02:0x17:CPU: 0x00000000000000000000000000000000
19:18:43:WU01:FS02:0x17:Machine: 2
19:18:43:WU01:FS02:0x17:Digital signatures verified
19:18:43:WU01:FS02:0x17:Folding@home GPU core17
19:18:43:WU01:FS02:0x17:Version 0.0.52
19:18:43:WU04:FS00:0xa3:
19:18:43:WU04:FS00:0xa3:*------------------------------*
19:18:43:WU04:FS00:0xa3:Folding@Home Gromacs SMP Core
19:18:43:WU04:FS00:0xa3:Version 2.27 (Dec. 15, 2010)
19:18:43:WU04:FS00:0xa3:
19:18:43:WU04:FS00:0xa3:Preparing to commence simulation
19:18:43:WU04:FS00:0xa3:- Ensuring status. Please wait.
19:18:43:WARNING:WU03:FS03:FahCore returned: FAILED_1 (0 = 0x0)
19:18:43:WU03:FS03:Starting
19:18:43:WU03:FS03:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 03 -suffix 01 -version 704 -lifeline 2744 -checkpoint 15 -gpu 2 -gpu-vendor ati
19:18:43:WU03:FS03:Started FahCore on PID 3844
19:18:43:WU03:FS03:Core PID:3856
19:18:43:WU03:FS03:FahCore 0x17 started
19:18:43:WU01:FS02:0x17: Found a checkpoint file
19:18:43:WARNING:WU03:FS03:FahCore returned: FAILED_1 (0 = 0x0)
19:18:52:WU00:FS01:0x16:- Looking at optimizations...
19:18:52:WU00:FS01:0x16:- Working with standard loops on this execution.
19:18:52:WU00:FS01:0x16:- Previous termination of core was improper.
19:18:52:WU00:FS01:0x16:- Going to use standard loops.
19:18:52:WU00:FS01:0x16:- Files status OK
19:18:52:WU00:FS01:0x16:sizeof(CORE_PACKET_HDR) = 512 file=<>
19:18:52:WU00:FS01:0x16:- Expanded 44912 -> 171163 (decompressed 381.1 percent)
19:18:52:WU00:FS01:0x16:Called DecompressByteArray: compressed_data_size=44912 data_size=171163, decompressed_data_size=171163 diff=0
19:18:52:WU00:FS01:0x16:- Digital signature verified
19:18:52:WU00:FS01:0x16:
19:18:52:WU00:FS01:0x16:Project: 11293 (Run 4, Clone 64, Gen 49)
19:18:52:WU00:FS01:0x16:
19:18:52:WU00:FS01:0x16:Entering M.D.
19:18:52:WU04:FS00:0xa3:- Looking at optimizations...
19:18:52:WU04:FS00:0xa3:- Working with standard loops on this execution.
19:18:52:WU04:FS00:0xa3:- Previous termination of core was improper.
19:18:52:WU04:FS00:0xa3:- Going to use standard loops.
19:18:52:WU04:FS00:0xa3:- Files status OK
19:18:52:WU04:FS00:0xa3:- Expanded 3793392 -> 4166140 (decompressed 109.8 percent)
19:18:52:WU04:FS00:0xa3:Called DecompressByteArray: compressed_data_size=3793392 data_size=4166140, decompressed_data_size=4166140 diff=0
19:18:52:WU04:FS00:0xa3:- Digital signature verified
19:18:52:WU04:FS00:0xa3:
19:18:52:WU04:FS00:0xa3:Project: 8558 (Run 1, Clone 9, Gen 374)
19:18:52:WU04:FS00:0xa3:
19:18:52:WU04:FS00:0xa3:Entering M.D.
19:18:54:WU00:FS01:0x16:Will resume from checkpoint file 00/wudata_01.ckp
19:18:54:WU00:FS01:0x16:Tpr hash 00/wudata_01.tpr: 1781123853 3205537409 1418442691 832764643 2934550790
19:18:54:WU00:FS01:0x16:Working on ALZHEIMER DISEASE AMYLOID
19:18:54:WU00:FS01:0x16:Client config unavailable.
19:18:54:WU00:FS01:0x16:Starting GUI Server
19:18:56:WU00:FS01:0x16:Resuming from checkpoint
19:18:56:WU00:FS01:0x16:fcCheckPointResume: retreived and current tpr file hash:
19:18:56:WU00:FS01:0x16: 0 1781123853 1781123853
19:18:56:WU00:FS01:0x16: 1 3205537409 3205537409
19:18:56:WU00:FS01:0x16: 2 1418442691 1418442691
19:18:56:WU00:FS01:0x16: 3 832764643 832764643
19:18:56:WU00:FS01:0x16: 4 2934550790 2934550790
19:18:56:WU00:FS01:0x16:fcCheckPointResume: file hashes same.
19:18:56:WU00:FS01:0x16:fcCheckPointResume: state restored.
19:18:56:WU00:FS01:0x16:fcCheckPointResume: name 00/wudata_01.log Verified 00/wudata_01.log
19:18:56:WU00:FS01:0x16:fcCheckPointResume: name 00/wudata_01.trr Verified 00/wudata_01.trr
19:18:56:WU00:FS01:0x16:fcCheckPointResume: name 00/wudata_01.xtc Verified 00/wudata_01.xtc
19:18:56:WU00:FS01:0x16:fcCheckPointResume: name 00/wudata_01.edr Verified 00/wudata_01.edr
19:18:56:WU00:FS01:0x16:fcCheckPointResume: state restored 2
19:18:56:WU00:FS01:0x16:Resumed from checkpoint
19:18:56:WU00:FS01:0x16:Setting checkpoint frequency: 500000
19:18:56:WU00:FS01:0x16:Completed 22000001 out of 50000000 steps (44%).
19:18:58:WU04:FS00:0xa3:Using Gromacs checkpoints
19:18:58:WU04:FS00:0xa3:Mapping NT from 4 to 4
19:18:59:WU04:FS00:0xa3:Resuming from checkpoint
19:18:59:WU04:FS00:0xa3:Verified 04/wudata_01.log
19:18:59:WU04:FS00:0xa3:Verified 04/wudata_01.trr
19:18:59:WU04:FS00:0xa3:Verified 04/wudata_01.edr
19:18:59:WU04:FS00:0xa3:Completed 328385 out of 500000 steps (65%)
19:19:43:WU03:FS03:Starting
19:19:43:WU03:FS03:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 03 -suffix 01 -version 704 -lifeline 2744 -checkpoint 15 -gpu 2 -gpu-vendor ati
19:19:43:WU03:FS03:Started FahCore on PID 4528
19:19:43:WU03:FS03:Core PID:4524
19:19:43:WU03:FS03:FahCore 0x17 started
19:19:44:WARNING:WU03:FS03:FahCore returned: FAILED_1 (0 = 0x0)
19:20:43:WU03:FS03:Starting
19:20:43:WU03:FS03:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 03 -suffix 01 -version 704 -lifeline 2744 -checkpoint 15 -gpu 2 -gpu-vendor ati
19:20:43:WU03:FS03:Started FahCore on PID 4324
19:20:43:WU03:FS03:Core PID:3432
19:20:43:WU03:FS03:FahCore 0x17 started
19:20:44:WARNING:WU03:FS03:FahCore returned: FAILED_1 (0 = 0x0)
19:21:43:WU03:FS03:Starting
19:21:43:WU03:FS03:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 03 -suffix 01 -version 704 -lifeline 2744 -checkpoint 15 -gpu 2 -gpu-vendor ati
19:21:43:WU03:FS03:Started FahCore on PID 5228
19:21:43:WU03:FS03:Core PID:5240
19:21:43:WU03:FS03:FahCore 0x17 started
19:21:43:WARNING:WU03:FS03:FahCore returned: FAILED_1 (0 = 0x0)
19:21:43:WARNING:WU03:FS03:Too many errors, failing
19:21:43:WU03:FS03:Sending unit results: id:03 state:SEND error:FAILED project:9201 run:359 clone:0 gen:173 core:0x17 unit:0x000000f26652edc45399e428148dca68
19:21:43:WU03:FS03:Connecting to 171.67.108.52:8080
19:21:44:WU03:FS03:Server responded WORK_ACK (400)
19:21:44:WU03:FS03:Cleaning up
19:21:44:WU02:FS03:Connecting to 171.67.108.201:80
19:21:45:WU01:FS02:0x17:Completed 1750000 out of 5000000 steps (35%)
19:21:45:WU01:FS02:0x17:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
Re: Any idea how many Core 16's are left?
Posted: Tue Sep 30, 2014 7:29 pm
by 7im
Yes, they will tell you a percentage if you ask for a legitimate reason. Asking for that info to see how many WUs you can safely drop is not, IMO, a legit reason.
Do you really need to shut down the multi-gpu systems? Just remove one of the two GPU slots in FAH for a while until the core_16s run out, or AMD fixes the driver.
Re: Any idea how many Core 16's are left?
Posted: Tue Sep 30, 2014 7:44 pm
by 7im
I see one GPU made it almost half way through a work unit.
19:18:56:WU00:FS01:0x16:Completed 22000001 out of 50000000 steps (44%)
I would expect it to fail a lot sooner if this was a driver issue.
Just to confirm a bad WU or not, maybe a mod can look up the failure/success of this WU?
19:18:52:WU00:FS01:0x16:Project: 11293 (Run 4, Clone 64, Gen 49)
Re: Any idea how many Core 16's are left?
Posted: Tue Sep 30, 2014 8:23 pm
by davidcoton
It appears that the last failure in the log was Core17:
19:21:43:WU03:FS03:Sending unit results: id:03 state:SEND error:FAILED project:9201 run:359 clone:0 gen:173 core:0x17 unit:0x000000f26652edc45399e428148dca68
Another case where the indices of multiple GPUs are not correctly resolved in the FAH software? That may not matter when three identical cards run the same core, but might well cause errors with mixed cores if the cards are mixed up. There's a thread somewhere about how to sort it out manually -- has that been tried?
Re: Any idea how many Core 16's are left?
Posted: Tue Sep 30, 2014 8:31 pm
by bollix47
viewtopic.php?p=199379#p199379
Re: Any idea how many Core 16's are left?
Posted: Tue Sep 30, 2014 9:11 pm
by Joe_H
7im wrote:Just to confirm a bad WU or not, maybe a mod can look up the failure/success of this WU?
19:18:52:WU00:FS01:0x16:Project: 11293 (Run 4, Clone 64, Gen 49)
There are successful returns of this WU in the database, so far no failures listed by the database.
Re: Any idea how many Core 16's are left?
Posted: Tue Sep 30, 2014 11:56 pm
by nivedita
I have dual 295x2's and my computer runs core x16 units successfully (driver 14.7RC3). However PPD is insanely low -- the GPUs working on those units essentially generate no points (<5k PPD vs 220-260k PPD on core 0x17 units). TPF is about 5m45s.
Re: Any idea how many Core 16's are left?
Posted: Wed Oct 01, 2014 12:25 am
by ChasingTheDream
7im wrote:Yes, they will tell you a percentage if you ask for a legitimate reason. Asking for that info to see how many WUs you can safely drop is not, IMO, a legit reason.
Do you really need to shut down the multi-gpu systems? Just remove one of the two GPU slots in FAH for a while until the core_16s run out, or AMD fixes the driver.
I find not being able to complete core 16 WU's 90% of the time on multi-GPU systems to be a legitimate reason. The work units will never complete. I can't keep the machines running long enough to complete them unless I stay available to reset them X times for each WU which I find to be just slightly unreasonable.
No you are right. I don't have to shut down the multi-GPU's completely. I could remove slots or the GPU's themselves which would mean I would have to take 7 of my 15 GPU's offline to accommodate WU that reached their EOL a year ago which to me doesn't make sense. I think it is fair to say I would get a lot more done by leaving my GPU's online and simply removing the uncooperative WU's. I would rather have an option in the client that simply allows me to not take Core 16 WU's on the machines I know won't run them. I don't see how this request is unreasonable especially since Standford knows there are issues with these WU's on some hardware.
I went through this thread, but all my GPU's are the exact some type. They are all R9 290X TRI-X cards. In GPU-Z all the cards appear to be exactly the same, even the device id. It would also mean that all my multi-GPU machines would have to have the indexing messed up.
And finally some more logs. The machine who's logs were shown above started locking constantly after I got the log shown above. In fact, in a couple cases I couldn't get the logs before the system locked again.
In the final log, it appears the client dumped most of the WU's and started over. The machine in question is connected to the lan wirelessly so there is a slight delay before the connection is established after a crash. Another part that makes it difficult is that I have to physically reset the computer and then unplug and replug the USB wireless adapter. So each lock / crash requires a physical intervention.
Code: Select all
*********************** Log Started 2014-09-30T21:10:19Z ***********************
21:10:19:************************* Folding@home Client *************************
21:10:19: Website: http://folding.stanford.edu/
21:10:19: Copyright: (c) 2009-2014 Stanford University
21:10:19: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
21:10:19: Args:
21:10:19: Config: C:/Users/Folder4/AppData/Roaming/FAHClient/config.xml
21:10:19:******************************** Build ********************************
21:10:19: Version: 7.4.4
21:10:19: Date: Mar 4 2014
21:10:19: Time: 20:26:54
21:10:19: SVN Rev: 4130
21:10:19: Branch: fah/trunk/client
21:10:19: Compiler: Intel(R) C++ MSVC 1500 mode 1200
21:10:19: Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
21:10:19: /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
21:10:19: Platform: win32 XP
21:10:19: Bits: 32
21:10:19: Mode: Release
21:10:19:******************************* System ********************************
21:10:19: CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
21:10:19: CPU ID: GenuineIntel Family 6 Model 60 Stepping 3
21:10:19: CPUs: 8
21:10:19: Memory: 15.95GiB
21:10:19: Free Memory: 14.76GiB
21:10:19: Threads: WINDOWS_THREADS
21:10:19: OS Version: 6.1
21:10:19: Has Battery: false
21:10:19: On Battery: false
21:10:19: UTC Offset: -5
21:10:19: PID: 2700
21:10:19: CWD: C:/Users/Folder4/AppData/Roaming/FAHClient
21:10:19: OS: Windows 7 Home Premium
21:10:19: OS Arch: AMD64
21:10:19: GPUs: 3
21:10:19: GPU 0: ATI:5 Hawaii [Radeon R9 200X Series]
21:10:19: GPU 1: ATI:5 Hawaii [Radeon R9 200X Series]
21:10:19: GPU 2: ATI:5 Hawaii [Radeon R9 200X Series]
21:10:19: CUDA: Not detected
21:10:19:Win32 Service: false
21:10:19:***********************************************************************
21:10:19:<config>
21:10:19: <!-- Network -->
21:10:19: <proxy v=':8080'/>
21:10:19:
21:10:19: <!-- Slot Control -->
21:10:19: <power v='full'/>
21:10:19:
21:10:19: <!-- User Information -->
21:10:19: <passkey v='********************************'/>
21:10:19: <team v='224497'/>
21:10:19: <user v='ChasingTheDream'/>
21:10:19:
21:10:19: <!-- Folding Slots -->
21:10:19: <slot id='0' type='CPU'>
21:10:19: <cpus v='4'/>
21:10:19: </slot>
21:10:19: <slot id='1' type='GPU'/>
21:10:19: <slot id='2' type='GPU'/>
21:10:19: <slot id='3' type='GPU'/>
21:10:19:</config>
21:10:19:Trying to access database...
21:10:19:Successfully acquired database lock
21:10:19:Enabled folding slot 00: READY cpu:4
21:10:19:Enabled folding slot 01: READY gpu:0:Hawaii [Radeon R9 200X Series]
21:10:19:Enabled folding slot 02: READY gpu:1:Hawaii [Radeon R9 200X Series]
21:10:19:Enabled folding slot 03: READY gpu:2:Hawaii [Radeon R9 200X Series]
21:10:19:WU01:FS02:Starting
21:10:19:WU01:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 01 -suffix 01 -version 704 -lifeline 2700 -checkpoint 15 -gpu 1 -gpu-vendor ati
21:10:19:WU01:FS02:Started FahCore on PID 3432
21:10:19:WU01:FS02:Core PID:3456
21:10:19:WU01:FS02:FahCore 0x17 started
21:10:19:WU04:FS00:Starting
21:10:19:WU04:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/Core_a3.fah/FahCore_a3.exe -dir 04 -suffix 01 -version 704 -lifeline 2700 -checkpoint 15 -np 4
21:10:19:WU04:FS00:Started FahCore on PID 3464
21:10:19:WU04:FS00:Core PID:3480
21:10:19:WU04:FS00:FahCore 0xa3 started
21:10:19:WU02:FS03:Starting
21:10:19:WU02:FS03:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_16.fah/FahCore_16.exe -dir 02 -suffix 01 -version 704 -lifeline 2700 -checkpoint 15 -gpu 2 -gpu-vendor ati
21:10:19:WU02:FS03:Started FahCore on PID 3488
21:10:19:WU02:FS03:Core PID:3504
21:10:19:WU02:FS03:FahCore 0x16 started
21:10:19:WU00:FS01:Starting
21:10:19:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_16.fah/FahCore_16.exe -dir 00 -suffix 01 -version 704 -lifeline 2700 -checkpoint 15 -gpu 0 -gpu-vendor ati
21:10:19:WU00:FS01:Started FahCore on PID 3524
21:10:19:WU00:FS01:Core PID:3540
21:10:19:WU00:FS01:FahCore 0x16 started
21:10:19:WU01:FS02:0x17:*********************** Log Started 2014-09-30T21:10:19Z ***********************
21:10:19:WU01:FS02:0x17:Project: 13000 (Run 1007, Clone 0, Gen 80)
21:10:19:WU01:FS02:0x17:Unit: 0x00000084538b3db75310b8069f489526
21:10:19:WU01:FS02:0x17:CPU: 0x00000000000000000000000000000000
21:10:19:WU01:FS02:0x17:Machine: 2
21:10:19:WU01:FS02:0x17:Digital signatures verified
21:10:19:WU01:FS02:0x17:Folding@home GPU core17
21:10:19:WU01:FS02:0x17:Version 0.0.52
21:10:20:WU04:FS00:0xa3:
21:10:20:WU04:FS00:0xa3:*------------------------------*
21:10:20:WU04:FS00:0xa3:Folding@Home Gromacs SMP Core
21:10:20:WU04:FS00:0xa3:Version 2.27 (Dec. 15, 2010)
21:10:20:WU04:FS00:0xa3:
21:10:20:WU04:FS00:0xa3:Preparing to commence simulation
21:10:20:WU04:FS00:0xa3:- Ensuring status. Please wait.
21:10:20:WU02:FS03:0x16:
21:10:20:WU02:FS03:0x16:*------------------------------*
21:10:20:WU02:FS03:0x16:Folding@Home GPU Core
21:10:20:WU02:FS03:0x16:Version 2.11 (Thu Dec 9 15:00:14 PST 2010)
21:10:20:WU02:FS03:0x16:
21:10:20:WU02:FS03:0x16:Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86
21:10:20:WU02:FS03:0x16:Build host: user-f6d030f24f
21:10:20:WU02:FS03:0x16:Board Type: AMD/OpenCL
21:10:20:WU02:FS03:0x16:Core : x=16
21:10:20:WU02:FS03:0x16: Window's signal control handler registered.
21:10:20:WU02:FS03:0x16:Preparing to commence simulation
21:10:20:WU02:FS03:0x16:- Ensuring status. Please wait.
21:10:20:WU00:FS01:0x16:
21:10:20:WU00:FS01:0x16:*------------------------------*
21:10:20:WU00:FS01:0x16:Folding@Home GPU Core
21:10:20:WU00:FS01:0x16:Version 2.11 (Thu Dec 9 15:00:14 PST 2010)
21:10:20:WU00:FS01:0x16:
21:10:20:WU00:FS01:0x16:Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86
21:10:20:WU00:FS01:0x16:Build host: user-f6d030f24f
21:10:20:WU00:FS01:0x16:Board Type: AMD/OpenCL
21:10:20:WU00:FS01:0x16:Core : x=16
21:10:20:WU00:FS01:0x16: Window's signal control handler registered.
21:10:20:WU00:FS01:0x16:Preparing to commence simulation
21:10:20:WU00:FS01:0x16:- Ensuring status. Please wait.
21:10:20:WU01:FS02:0x17: Found a checkpoint file
21:10:29:WU04:FS00:0xa3:- Looking at optimizations...
21:10:29:WU04:FS00:0xa3:- Working with standard loops on this execution.
21:10:29:WU04:FS00:0xa3:- Previous termination of core was improper.
21:10:29:WU04:FS00:0xa3:- Going to use standard loops.
21:10:29:WU04:FS00:0xa3:- Files status OK
21:10:29:WU02:FS03:0x16:- Looking at optimizations...
21:10:29:WU02:FS03:0x16:- Working with standard loops on this execution.
21:10:29:WU02:FS03:0x16:- Previous termination of core was improper.
21:10:29:WU02:FS03:0x16:- Going to use standard loops.
21:10:29:WU02:FS03:0x16:- Files status OK
21:10:29:WU02:FS03:0x16:sizeof(CORE_PACKET_HDR) = 512 file=<>
21:10:29:WU02:FS03:0x16:- Expanded 45155 -> 171163 (decompressed 379.0 percent)
21:10:29:WU02:FS03:0x16:Called DecompressByteArray: compressed_data_size=45155 data_size=171163, decompressed_data_size=171163 diff=0
21:10:29:WU02:FS03:0x16:- Digital signature verified
21:10:29:WU02:FS03:0x16:
21:10:29:WU02:FS03:0x16:Project: 11292 (Run 5, Clone 8, Gen 19)
21:10:29:WU02:FS03:0x16:
21:10:29:WU02:FS03:0x16:Entering M.D.
21:10:29:WU00:FS01:0x16:- Looking at optimizations...
21:10:29:WU00:FS01:0x16:- Working with standard loops on this execution.
21:10:29:WU00:FS01:0x16:- Previous termination of core was improper.
21:10:29:WU00:FS01:0x16:- Going to use standard loops.
21:10:29:WU00:FS01:0x16:- Files status OK
21:10:29:WU00:FS01:0x16:sizeof(CORE_PACKET_HDR) = 512 file=<>
21:10:29:WU00:FS01:0x16:- Expanded 44912 -> 171163 (decompressed 381.1 percent)
21:10:29:WU00:FS01:0x16:Called DecompressByteArray: compressed_data_size=44912 data_size=171163, decompressed_data_size=171163 diff=0
21:10:29:WU00:FS01:0x16:- Digital signature verified
21:10:29:WU00:FS01:0x16:
21:10:29:WU00:FS01:0x16:Project: 11293 (Run 4, Clone 64, Gen 49)
21:10:29:WU00:FS01:0x16:
21:10:29:WU00:FS01:0x16:Entering M.D.
21:10:29:WU04:FS00:0xa3:- Expanded 3793392 -> 4166140 (decompressed 109.8 percent)
21:10:29:WU04:FS00:0xa3:Called DecompressByteArray: compressed_data_size=3793392 data_size=4166140, decompressed_data_size=4166140 diff=0
21:10:29:WU04:FS00:0xa3:- Digital signature verified
21:10:29:WU04:FS00:0xa3:
21:10:29:WU04:FS00:0xa3:Project: 8558 (Run 1, Clone 9, Gen 374)
21:10:29:WU04:FS00:0xa3:
21:10:29:WU04:FS00:0xa3:Entering M.D.
21:10:31:WU02:FS03:0x16:Will resume from checkpoint file 02/wudata_01.ckp
21:10:31:WU02:FS03:0x16:Tpr hash 02/wudata_01.tpr: 1452804746 3959827554 1230222674 762078638 2748005047
21:10:31:WU02:FS03:0x16:Working on ALZHEIMER DISEASE AMYLOID
21:10:31:WU02:FS03:0x16:Client config unavailable.
21:10:31:WU00:FS01:0x16:Will resume from checkpoint file 00/wudata_01.ckp
21:10:31:WU00:FS01:0x16:Tpr hash 00/wudata_01.tpr: 1781123853 3205537409 1418442691 832764643 2934550790
21:10:31:WU00:FS01:0x16:Working on ALZHEIMER DISEASE AMYLOID
21:10:31:WU00:FS01:0x16:Client config unavailable.
21:10:31:WU02:FS03:0x16:Starting GUI Server
21:10:31:WU00:FS01:0x16:Starting GUI Server
21:10:33:WU00:FS01:0x16:Resuming from checkpoint
21:10:33:WU00:FS01:0x16:fcCheckPointResume: retreived and current tpr file hash:
21:10:33:WU00:FS01:0x16: 0 1781123853 1781123853
21:10:33:WU00:FS01:0x16: 1 3205537409 3205537409
21:10:33:WU00:FS01:0x16: 2 1418442691 1418442691
21:10:33:WU00:FS01:0x16: 3 832764643 832764643
21:10:33:WU00:FS01:0x16: 4 2934550790 2934550790
21:10:33:WU00:FS01:0x16:fcCheckPointResume: file hashes same.
21:10:33:WU00:FS01:0x16:fcCheckPointResume: state restored.
21:10:33:WU00:FS01:0x16:fcCheckPointResume: name 00/wudata_01.log Verified 00/wudata_01.log
21:10:33:WU00:FS01:0x16:fcCheckPointResume: name 00/wudata_01.trr Verified 00/wudata_01.trr
21:10:33:WU00:FS01:0x16:fcCheckPointResume: name 00/wudata_01.xtc Verified 00/wudata_01.xtc
21:10:33:WU00:FS01:0x16:fcCheckPointResume: name 00/wudata_01.edr Verified 00/wudata_01.edr
21:10:33:WU00:FS01:0x16:fcCheckPointResume: state restored 2
21:10:33:WU00:FS01:0x16:Resumed from checkpoint
21:10:33:WU00:FS01:0x16:Setting checkpoint frequency: 500000
21:10:33:WU00:FS01:0x16:Completed 26500001 out of 50000000 steps (53%).
21:10:33:WU02:FS03:0x16:Resuming from checkpoint
21:10:33:WU02:FS03:0x16:fcCheckPointResume: retreived and current tpr file hash:
21:10:33:WU02:FS03:0x16: 0 1452804746 1452804746
21:10:33:WU02:FS03:0x16: 1 3959827554 3959827554
21:10:33:WU02:FS03:0x16: 2 1230222674 1230222674
21:10:33:WU02:FS03:0x16: 3 762078638 762078638
21:10:33:WU02:FS03:0x16: 4 2748005047 2748005047
21:10:33:WU02:FS03:0x16:fcCheckPointResume: file hashes same.
21:10:33:WU02:FS03:0x16:fcCheckPointResume: state restored.
21:10:33:WU02:FS03:0x16:fcCheckPointResume: name 02/wudata_01.log Verified 02/wudata_01.log
21:10:33:WU02:FS03:0x16:fcCheckPointResume: name 02/wudata_01.trr Verified 02/wudata_01.trr
21:10:33:WU02:FS03:0x16:fcCheckPointResume: name 02/wudata_01.xtc Verified 02/wudata_01.xtc
21:10:33:WU02:FS03:0x16:fcCheckPointResume: name 02/wudata_01.edr Verified 02/wudata_01.edr
21:10:33:WU02:FS03:0x16:fcCheckPointResume: state restored 2
21:10:33:WU02:FS03:0x16:Resumed from checkpoint
21:10:33:WU02:FS03:0x16:Setting checkpoint frequency: 599999
21:10:33:WU02:FS03:0x16:Completed 4199994 out of 59999936 steps (6%).
21:10:33:WU02:FS03:0x16:Completed 4199996 out of 59999936 steps (7%).
21:10:35:WU04:FS00:0xa3:Using Gromacs checkpoints
21:10:35:WU04:FS00:0xa3:Mapping NT from 4 to 4
21:10:36:WU04:FS00:0xa3:Resuming from checkpoint
21:10:36:WU04:FS00:0xa3:Verified 04/wudata_01.log
21:10:36:WU04:FS00:0xa3:Verified 04/wudata_01.trr
21:10:36:WU04:FS00:0xa3:Verified 04/wudata_01.edr
21:10:36:WU04:FS00:0xa3:Completed 352155 out of 500000 steps (70%)
There was probably 5-6 crashes between the log above and this one.
Code: Select all
*********************** Log Started 2014-09-30T23:52:09Z ***********************
23:52:09:************************* Folding@home Client *************************
23:52:09: Website: http://folding.stanford.edu/
23:52:09: Copyright: (c) 2009-2014 Stanford University
23:52:09: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
23:52:09: Args:
23:52:09: Config: C:/Users/Folder4/AppData/Roaming/FAHClient/config.xml
23:52:09:******************************** Build ********************************
23:52:09: Version: 7.4.4
23:52:09: Date: Mar 4 2014
23:52:09: Time: 20:26:54
23:52:09: SVN Rev: 4130
23:52:09: Branch: fah/trunk/client
23:52:09: Compiler: Intel(R) C++ MSVC 1500 mode 1200
23:52:09: Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
23:52:09: /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
23:52:09: Platform: win32 XP
23:52:09: Bits: 32
23:52:09: Mode: Release
23:52:09:******************************* System ********************************
23:52:09: CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
23:52:09: CPU ID: GenuineIntel Family 6 Model 60 Stepping 3
23:52:09: CPUs: 8
23:52:09: Memory: 15.95GiB
23:52:09: Free Memory: 14.76GiB
23:52:09: Threads: WINDOWS_THREADS
23:52:09: OS Version: 6.1
23:52:09: Has Battery: false
23:52:09: On Battery: false
23:52:09: UTC Offset: -5
23:52:09: PID: 2784
23:52:09: CWD: C:/Users/Folder4/AppData/Roaming/FAHClient
23:52:09: OS: Windows 7 Home Premium
23:52:09: OS Arch: AMD64
23:52:09: GPUs: 3
23:52:09: GPU 0: ATI:5 Hawaii [Radeon R9 200X Series]
23:52:09: GPU 1: ATI:5 Hawaii [Radeon R9 200X Series]
23:52:09: GPU 2: ATI:5 Hawaii [Radeon R9 200X Series]
23:52:09: CUDA: Not detected
23:52:09:Win32 Service: false
23:52:09:***********************************************************************
23:52:09:<config>
23:52:09: <!-- Network -->
23:52:09: <proxy v=':8080'/>
23:52:09:
23:52:09: <!-- Slot Control -->
23:52:09: <power v='full'/>
23:52:09:
23:52:09: <!-- User Information -->
23:52:09: <passkey v='********************************'/>
23:52:09: <team v='224497'/>
23:52:09: <user v='ChasingTheDream'/>
23:52:09:
23:52:09: <!-- Folding Slots -->
23:52:09: <slot id='0' type='CPU'>
23:52:09: <cpus v='4'/>
23:52:09: </slot>
23:52:09: <slot id='1' type='GPU'/>
23:52:09: <slot id='2' type='GPU'/>
23:52:09: <slot id='3' type='GPU'/>
23:52:09:</config>
23:52:09:Trying to access database...
23:52:09:Successfully acquired database lock
23:52:09:Enabled folding slot 00: READY cpu:4
23:52:09:Enabled folding slot 01: READY gpu:0:Hawaii [Radeon R9 200X Series]
23:52:09:Enabled folding slot 02: READY gpu:1:Hawaii [Radeon R9 200X Series]
23:52:09:Enabled folding slot 03: READY gpu:2:Hawaii [Radeon R9 200X Series]
23:52:09:WU01:FS02:Starting
23:52:09:WU01:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 01 -suffix 01 -version 704 -lifeline 2784 -checkpoint 15 -gpu 1 -gpu-vendor ati
23:52:09:WU01:FS02:Started FahCore on PID 3676
23:52:09:WU01:FS02:Core PID:3700
23:52:09:WU01:FS02:FahCore 0x17 started
23:52:09:WU02:FS03:Starting
23:52:09:WU02:FS03:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_16.fah/FahCore_16.exe -dir 02 -suffix 01 -version 704 -lifeline 2784 -checkpoint 15 -gpu 2 -gpu-vendor ati
23:52:09:WU02:FS03:Started FahCore on PID 3712
23:52:09:WU02:FS03:Core PID:3728
23:52:09:WU02:FS03:FahCore 0x16 started
23:52:09:WU00:FS01:Sending unit results: id:00 state:SEND error:FAILED project:11293 run:4 clone:64 gen:49 core:0x16 unit:0x000000a96652edbc4d92567ecc82d279
23:52:09:WU04:FS00:Sending unit results: id:04 state:SEND error:FAILED project:8558 run:1 clone:9 gen:374 core:0xa3 unit:0x00000303fbcb017c5203f8f8b6f1bc0c
23:52:09:WU00:FS01:Connecting to 171.67.108.44:8080
23:52:09:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80
23:52:09:WU00:FS01:Connecting to 171.67.108.44:80
23:52:09:WU04:FS00:Connecting to 128.143.199.96:8080
23:52:09:WARNING:WU04:FS00:WorkServer connection failed on port 8080 trying 80
23:52:09:WU04:FS00:Connecting to 128.143.199.96:80
23:52:09:WARNING:WU03:FS00:Exception: Could not get IP address for assign3.stanford.edu: No such host is known.
23:52:09:ERROR:WU03:FS00:Exception: Could not get an assignment
23:52:09:WARNING:WU05:FS01:Exception: Could not get IP address for assign-GPU.stanford.edu: No such host is known.
23:52:09:ERROR:WU05:FS01:Exception: Could not get an assignment
23:52:09:WU01:FS02:0x17:*********************** Log Started 2014-09-30T23:52:09Z ***********************
23:52:09:WU01:FS02:0x17:Project: 13000 (Run 1007, Clone 0, Gen 80)
23:52:09:WU01:FS02:0x17:Unit: 0x00000084538b3db75310b8069f489526
23:52:09:WU01:FS02:0x17:CPU: 0x00000000000000000000000000000000
23:52:09:WU01:FS02:0x17:Machine: 2
23:52:09:WU01:FS02:0x17:Digital signatures verified
23:52:09:WU01:FS02:0x17:Folding@home GPU core17
23:52:09:WU01:FS02:0x17:Version 0.0.52
23:52:09:WARNING:WU03:FS00:Exception: Could not get IP address for assign3.stanford.edu: No such host is known.
23:52:09:ERROR:WU03:FS00:Exception: Could not get an assignment
23:52:09:WARNING:WU05:FS01:Exception: Could not get IP address for assign-GPU.stanford.edu: No such host is known.
23:52:09:ERROR:WU05:FS01:Exception: Could not get an assignment
23:52:09:WU02:FS03:0x16:
23:52:09:WU02:FS03:0x16:*------------------------------*
23:52:09:WU02:FS03:0x16:Folding@Home GPU Core
23:52:09:WU02:FS03:0x16:Version 2.11 (Thu Dec 9 15:00:14 PST 2010)
23:52:09:WU02:FS03:0x16:
23:52:09:WU02:FS03:0x16:Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86
23:52:09:WU02:FS03:0x16:Build host: user-f6d030f24f
23:52:09:WU02:FS03:0x16:Board Type: AMD/OpenCL
23:52:09:WU02:FS03:0x16:Core : x=16
23:52:09:WU02:FS03:0x16: Window's signal control handler registered.
23:52:09:WU02:FS03:0x16:Preparing to commence simulation
23:52:09:WU02:FS03:0x16:- Looking at optimizations...
23:52:09:WU02:FS03:0x16:- Files status OK
23:52:09:WU02:FS03:0x16:sizeof(CORE_PACKET_HDR) = 512 file=<>
23:52:09:WU02:FS03:0x16:- Expanded 45155 -> 171163 (decompressed 379.0 percent)
23:52:09:WU02:FS03:0x16:Called DecompressByteArray: compressed_data_size=45155 data_size=171163, decompressed_data_size=171163 diff=0
23:52:09:WU02:FS03:0x16:- Digital signature verified
23:52:09:WU02:FS03:0x16:
23:52:09:WU02:FS03:0x16:Project: 11292 (Run 5, Clone 8, Gen 19)
23:52:09:WU02:FS03:0x16:
23:52:09:WU02:FS03:0x16:Assembly optimizations on if available.
23:52:09:WU02:FS03:0x16:Entering M.D.
23:52:10:WARNING:WU00:FS01:Exception: Failed to send results to work server: Failed to connect to 171.67.108.44:80: A socket operation was attempted to an unreachable network.
23:52:10:WU01:FS02:0x17: Found a checkpoint file
23:52:10:WARNING:WU04:FS00:Exception: Failed to send results to work server: Failed to connect to 128.143.199.96:80: A socket operation was attempted to an unreachable network.
23:52:10:WU04:FS00:Trying to send results to collection server
23:52:10:WU04:FS00:Connecting to 128.143.231.201:8080
23:52:10:WARNING:WU04:FS00:WorkServer connection failed on port 8080 trying 80
23:52:10:WU04:FS00:Connecting to 128.143.231.201:80
23:52:10:WU00:FS01:Sending unit results: id:00 state:SEND error:FAILED project:11293 run:4 clone:64 gen:49 core:0x16 unit:0x000000a96652edbc4d92567ecc82d279
23:52:10:WU00:FS01:Connecting to 171.67.108.44:8080
23:52:10:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80
23:52:10:WU00:FS01:Connecting to 171.67.108.44:80
23:52:10:ERROR:WU04:FS00:Exception: Failed to connect to 128.143.231.201:80: A socket operation was attempted to an unreachable network.
23:52:11:WARNING:WU00:FS01:Exception: Failed to send results to work server: Failed to connect to 171.67.108.44:80: A socket operation was attempted to an unreachable network.
23:52:11:WU04:FS00:Sending unit results: id:04 state:SEND error:FAILED project:8558 run:1 clone:9 gen:374 core:0xa3 unit:0x00000303fbcb017c5203f8f8b6f1bc0c
23:52:11:WU04:FS00:Connecting to 128.143.199.96:8080
23:52:11:WARNING:WU04:FS00:WorkServer connection failed on port 8080 trying 80
23:52:11:WU04:FS00:Connecting to 128.143.199.96:80
23:52:11:WARNING:WU04:FS00:Exception: Failed to send results to work server: Failed to connect to 128.143.199.96:80: A socket operation was attempted to an unreachable network.
23:52:11:WU04:FS00:Trying to send results to collection server
23:52:11:WU04:FS00:Connecting to 128.143.231.201:8080
23:52:11:WARNING:WU04:FS00:WorkServer connection failed on port 8080 trying 80
23:52:11:WU04:FS00:Connecting to 128.143.231.201:80
23:52:11:WU02:FS03:0x16:Will resume from checkpoint file 02/wudata_01.ckp
23:52:11:WU02:FS03:0x16:Tpr hash 02/wudata_01.tpr: 1452804746 3959827554 1230222674 762078638 2748005047
23:52:11:WU02:FS03:0x16:Working on ALZHEIMER DISEASE AMYLOID
23:52:11:WU02:FS03:0x16:Client config unavailable.
23:52:11:ERROR:WU04:FS00:Exception: Failed to connect to 128.143.231.201:80: A socket operation was attempted to an unreachable network.
23:52:11:WU02:FS03:0x16:Starting GUI Server
23:52:13:WU02:FS03:0x16:Resuming from checkpoint
23:52:13:WU02:FS03:0x16:fcCheckPointResume: retreived and current tpr file hash:
23:52:13:WU02:FS03:0x16: 0 1452804746 1452804746
23:52:13:WU02:FS03:0x16: 1 3959827554 3959827554
23:52:13:WU02:FS03:0x16: 2 1230222674 1230222674
23:52:13:WU02:FS03:0x16: 3 762078638 762078638
23:52:13:WU02:FS03:0x16: 4 2748005047 2748005047
23:52:13:WU02:FS03:0x16:fcCheckPointResume: file hashes same.
23:52:13:WU02:FS03:0x16:fcCheckPointResume: state restored.
23:52:13:WU02:FS03:0x16:fcCheckPointResume: name 02/wudata_01.log Verified 02/wudata_01.log
23:52:13:WU02:FS03:0x16:fcCheckPointResume: name 02/wudata_01.trr Verified 02/wudata_01.trr
23:52:13:WU02:FS03:0x16:fcCheckPointResume: name 02/wudata_01.xtc Verified 02/wudata_01.xtc
23:52:13:WU02:FS03:0x16:fcCheckPointResume: name 02/wudata_01.edr Verified 02/wudata_01.edr
23:52:13:WU02:FS03:0x16:fcCheckPointResume: state restored 2
23:52:13:WU02:FS03:0x16:Resumed from checkpoint
23:52:13:WU02:FS03:0x16:Setting checkpoint frequency: 599999
23:52:13:WU02:FS03:0x16:Completed 13799978 out of 59999936 steps (22%).
23:52:13:WU02:FS03:0x16:Completed 13799986 out of 59999936 steps (23%).
23:53:09:WU03:FS00:Connecting to 171.67.108.200:8080
23:53:10:WU05:FS01:Connecting to 171.67.108.201:80
23:53:10:WU03:FS00:Assigned to work server 128.252.203.2
23:53:10:WU03:FS00:Requesting new work unit for slot 00: READY cpu:4 from 128.252.203.2
23:53:10:WU03:FS00:Connecting to 128.252.203.2:8080
23:53:10:WU00:FS01:Sending unit results: id:00 state:SEND error:FAILED project:11293 run:4 clone:64 gen:49 core:0x16 unit:0x000000a96652edbc4d92567ecc82d279
23:53:10:WU05:FS01:Assigned to work server 140.163.4.231
23:53:10:WU00:FS01:Connecting to 171.67.108.44:8080
23:53:10:WU05:FS01:Requesting new work unit for slot 01: READY gpu:0:Hawaii [Radeon R9 200X Series] from 140.163.4.231
23:53:10:WU05:FS01:Connecting to 140.163.4.231:8080
23:53:10:WU05:FS01:Downloading 4.83MiB
23:53:11:WU00:FS01:Server responded WORK_QUIT (404)
23:53:11:WARNING:WU00:FS01:Server did not like results, dumping
23:53:11:WU00:FS01:Cleaning up
23:53:11:WU04:FS00:Sending unit results: id:04 state:SEND error:FAILED project:8558 run:1 clone:9 gen:374 core:0xa3 unit:0x00000303fbcb017c5203f8f8b6f1bc0c
23:53:11:WU04:FS00:Connecting to 128.143.199.96:8080
23:53:11:WU03:FS00:Downloading 1.16MiB
23:53:11:WU04:FS00:Server responded WORK_QUIT (404)
23:53:11:WARNING:WU04:FS00:Server did not like results, dumping
23:53:11:WU04:FS00:Cleaning up
23:53:14:WU05:FS01:Download complete
23:53:14:WU05:FS01:Received Unit: id:05 state:DOWNLOAD error:NO_ERROR project:13001 run:476 clone:2 gen:70 core:0x17 unit:0x00000093538b3db7532c830daa5f957c
23:53:14:WU05:FS01:Starting
23:53:14:WU05:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 05 -suffix 01 -version 704 -lifeline 2784 -checkpoint 15 -gpu 0 -gpu-vendor ati
23:53:14:WU05:FS01:Started FahCore on PID 404
23:53:14:WU05:FS01:Core PID:4008
23:53:14:WU05:FS01:FahCore 0x17 started
23:53:14:WU05:FS01:0x17:*********************** Log Started 2014-09-30T23:53:14Z ***********************
23:53:14:WU05:FS01:0x17:Project: 13001 (Run 476, Clone 2, Gen 70)
23:53:14:WU05:FS01:0x17:Unit: 0x00000093538b3db7532c830daa5f957c
23:53:14:WU05:FS01:0x17:CPU: 0x00000000000000000000000000000000
23:53:14:WU05:FS01:0x17:Machine: 1
23:53:14:WU05:FS01:0x17:Reading tar file state.xml
23:53:15:WU05:FS01:0x17:Reading tar file system.xml
23:53:15:WU03:FS00:Download complete
23:53:15:WU05:FS01:0x17:Reading tar file integrator.xml
23:53:15:WU05:FS01:0x17:Reading tar file core.xml
23:53:15:WU05:FS01:0x17:Digital signatures verified
23:53:15:WU05:FS01:0x17:Folding@home GPU core17
23:53:15:WU05:FS01:0x17:Version 0.0.52
23:53:15:WU03:FS00:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:10187 run:677 clone:0 gen:16 core:0xa4 unit:0x000000104c71bbb053f63e450794abee
23:53:15:WU03:FS00:Starting
23:53:15:WU03:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Folder4/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 03 -suffix 01 -version 704 -lifeline 2784 -checkpoint 15 -np 4
23:53:15:WU03:FS00:Started FahCore on PID 2420
23:53:15:WU03:FS00:Core PID:4472
23:53:15:WU03:FS00:FahCore 0xa4 started
23:53:16:WU03:FS00:0xa4:
23:53:16:WU03:FS00:0xa4:*------------------------------*
23:53:16:WU03:FS00:0xa4:Folding@Home Gromacs GB Core
23:53:16:WU03:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
23:53:16:WU03:FS00:0xa4:
23:53:16:WU03:FS00:0xa4:Preparing to commence simulation
23:53:16:WU03:FS00:0xa4:- Looking at optimizations...
23:53:16:WU03:FS00:0xa4:- Created dyn
23:53:16:WU03:FS00:0xa4:- Files status OK
23:53:16:WU03:FS00:0xa4:- Expanded 1221021 -> 2730916 (decompressed 223.6 percent)
23:53:16:WU03:FS00:0xa4:Called DecompressByteArray: compressed_data_size=1221021 data_size=2730916, decompressed_data_size=2730916 diff=0
23:53:16:WU03:FS00:0xa4:- Digital signature verified
23:53:16:WU03:FS00:0xa4:
23:53:16:WU03:FS00:0xa4:Project: 10187 (Run 677, Clone 0, Gen 16)
23:53:16:WU03:FS00:0xa4:
23:53:16:WU03:FS00:0xa4:Assembly optimizations on if available.
23:53:16:WU03:FS00:0xa4:Entering M.D.
23:53:21:WU03:FS00:0xa4:Mapping NT from 4 to 4
23:53:22:WU03:FS00:0xa4:Completed 0 out of 500000 steps (0%)
Re: Any idea how many Core 16's are left?
Posted: Wed Oct 01, 2014 12:30 am
by ChasingTheDream
nivedita wrote:I have dual 295x2's and my computer runs core x16 units successfully (driver 14.7RC3). However PPD is insanely low -- the GPUs working on those units essentially generate no points (<5k PPD vs 220-260k PPD on core 0x17 units). TPF is about 5m45s.
I've got 15 R9 290X TRI-X's and they just don't seem to be able to run the Core 16's on anything other than a single GPU. I've got three machines that run three GPU's and they will run Core 17's all day long without incident but they just don't like the core 16's. I have another machine that is just two GPU's and it handles them better but actually crashes completely when it has an issue. Full on BSD reboot.
Every once in awhile a core 16 will make it through though, but the vast majority don't make it to completion and never will due to the constant crashes.
In any event, the PPD aren't a concern for me. I just don't have time to constantly babysit the machines to try to get these WU's through.
Re: Any idea how many Core 16's are left?
Posted: Wed Oct 01, 2014 2:01 am
by bruce
ChasingTheDream wrote:I've got 15 R9 290X TRI-X's and they just don't seem to be able to run the Core 16's on anything other than a single GPU. I've got three machines that run three GPU's and they will run Core 17's all day long without incident but they just don't like the core 16's. I have another machine that is just two GPU's and it handles them better but actually crashes completely when it has an issue. Full on BSD reboot.
Every once in awhile a core 16 will make it through though, but the vast majority don't make it to completion and never will due to the constant crashes.
In any event, the PPD aren't a concern for me. I just don't have time to constantly babysit the machines to try to get these WU's through.
I don't know how many of those GPUs represent multiple installations or how they're grouped. Nevertheless, consider the following possibilities.
I can't predict when Core_16 WUs will go back to a status of mostly inactive, but consider running a single GPU per system until it does. If the drivers are not capable of running multiple dissimilar GPUs, it may be a long time before ATI fixes that problem for older GPUs --- or, consider only running only pairs of identical GPUs. Driver development and testing invariably starts with single GPUs, eventually proceeding to CrossFire where identical pairs of GPUs are tested -- which means those options are more likely to work than random combinations.
In fact, in the literature describing XDMA, they specifically mention that a HD7xxx can be paired with a R9-xxxx so debugging of the latest drivers on the latest hardware is going to be more thorough.
Let us know what you discover.