Page 1 of 4
AMD GPU Error sortShortList on some projects
Posted: Sat Mar 21, 2020 11:57 am
by sam6861
Looks like some specific projects always error on both my AMD GPU. Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
Project 11759 error 3 times, 2 on AMD RX 580 plus 1 on AMD Vega 11.
Project 14533 error on Vega 11 (FS03), passed on NVidia GT 1030 (FS02).
Project 11747, 11752, 11758, 11759, 11764, 11776, 14533, 14551: None of these project was able to complete on my AMD GPU. I believe these projects may work fine on NVidia GPU, when AMD errors.
Most other projects works fine on both my AMD GPU with no errors.
Windows 10 64 bit 1909, was AMD driver 20.2.2, NVidia driver 442.50
Recently updated to AMD driver 20.3.1, but continue to have error sortShortList on some projects.
FS01: AMD RX 580
FS02: NVidia GT 1030
FS03: AMD Vega 11 (integrated graphics)
Code: Select all
*********************** Log Started 2020-03-20T07:18:34Z ***********************
07:18:34:************************* Folding@home Client *************************
07:18:34: Copyright: (c) 2009-2018 foldingathome.org
07:18:34: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
07:18:34: Args: --open-web-control
07:18:34: Config: C:\Users\sam86\AppData\Roaming\FAHClient\config.xml
07:18:34:******************************** Build ********************************
07:18:34: Version: 7.5.1
07:18:34: Date: May 11 2018
07:18:34: Time: 13:06:32
07:18:34: Repository: Git
07:18:34: Revision: 4705bf53c635f88b8fe85af7675557e15d491ff0
07:18:34: Branch: master
07:18:34: Compiler: Visual C++ 2008
07:18:34: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
07:18:34: Platform: win32 10
07:18:34: Bits: 32
07:18:34: Mode: Release
07:18:34:******************************* System ********************************
07:18:34: CPU: AMD Ryzen 5 2400G with Radeon Vega Graphics
07:18:34: CPU ID: AuthenticAMD Family 23 Model 17 Stepping 0
07:18:34: CPUs: 8
07:18:34: Memory: 31.81GiB
07:18:34: Free Memory: 28.77GiB
07:18:34: Threads: WINDOWS_THREADS
07:18:34: OS Version: 6.2
07:18:34: Has Battery: false
07:18:34: On Battery: false
07:18:34: UTC Offset: -5
07:18:34: PID: 3128
07:18:34: CWD: C:\Users\sam86\AppData\Roaming\FAHClient
07:18:34: OS: Windows 10 Home
07:18:34: OS Arch: AMD64
07:18:34: GPUs: 3
07:18:34: GPU 0: Bus:1 Slot:0 Func:0 AMD:5 Ellesmere XT [Radeon RX
07:18:34: 470/480/570/580]
07:18:34: GPU 1: Bus:6 Slot:0 Func:0 NVIDIA:5 GP108 [GeForce GT 1030]
07:18:34: GPU 2: Bus:8 Slot:0 Func:0 AMD:5 Raven [Ryzen vega 8 moble]
07:18:34: CUDA Device 0: Platform:0 Device:0 Bus:6 Slot:0 Compute:6.1 Driver:10.2
07:18:34:OpenCL Device 0: Platform:0 Device:0 Bus:6 Slot:0 Compute:1.2 Driver:442.50
07:18:34:OpenCL Device 1: Platform:1 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:3004.8
07:18:34:OpenCL Device 2: Platform:1 Device:1 Bus:8 Slot:0 Compute:1.2 Driver:3004.8
07:18:34: Win32 Service: false
07:18:34:***********************************************************************
07:18:34:<config>
07:18:34: <!-- Network -->
07:18:34: <proxy v=':8080'/>
07:18:34:
07:18:34: <!-- Slot Control -->
07:18:34: <pause-on-start v='true'/>
07:18:34: <power v='full'/>
07:18:34:
07:18:34: <!-- User Information -->
07:18:34: <passkey v='********************************'/>
07:18:34: <user v='sam6861'/>
07:18:34:
07:18:34: <!-- Folding Slots -->
07:18:34: <slot id='0' type='CPU'>
07:18:34: <cpus v='6'/>
07:18:34: </slot>
07:18:34: <slot id='1' type='GPU'/>
07:18:34: <slot id='2' type='GPU'/>
07:18:34: <slot id='3' type='GPU'/>
07:18:34:</config>
07:18:34:Trying to access database...
07:18:34:Successfully acquired database lock
07:18:34:Enabled folding slot 00: PAUSED cpu:6 (by user)
07:18:34:Enabled folding slot 01: PAUSED gpu:0:Ellesmere XT [Radeon RX 470/480/570/580] (by user)
07:18:34:Enabled folding slot 02: PAUSED gpu:1:GP108 [GeForce GT 1030] (by user)
07:18:34:Enabled folding slot 03: PAUSED gpu:2:Raven [Ryzen vega 8 moble] (by user)
C:\Users\sam86\AppData\Roaming\FAHClient>findstr "state:SEND" logs\*.txt log.txt | findstr "project:14533"
log.txt:16:14:28:WU04:FS03:Sending unit results: id:04 state:SEND error:FAULTY project:14533 run:0 clone:13508 gen:3 core:0x22 unit:0x0000000380fccb025e72f2f0d793be66
log.txt:01:44:29:WU01:FS02:Sending unit results: id:01 state:SEND error:NO_ERROR project:14533 run:0 clone:4088 gen:1 core:0x22 unit:0x0000000280fccb025e72f225d8c17c1e
... removed repeated lines from findstr "state:SEND" logs\*.txt log.txt, and saved as TESET1\1.txt
C:\Users\sam86\AppData\Roaming\FAHClient>findstr "FAULTY" TEST1\1.txt
logs\log-20200317-105142.txt:14:25:58:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:11752 run:0 clone:2778 gen:0 core:0x22 unit:0x000000018ca304e75e6a806a9aceef36
logs\log-20200317-105142.txt:04:34:51:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:11747 run:0 clone:735 gen:1 core:0x22 unit:0x000000038ca304e75e6a7fc84de528cd
logs\log-20200317-105142.txt:21:15:57:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:11759 run:0 clone:10460 gen:1 core:0x22 unit:0x0000000180fccb0a5e6eb02ca0f7c315
logs\log-20200317-105142.txt:22:16:24:WU01:FS03:Sending unit results: id:01 state:SEND error:FAULTY project:11759 run:0 clone:5218 gen:1 core:0x22 unit:0x0000000280fccb0a5e6e863ce7701d4f
logs\log-20200318-031735.txt:02:40:40:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:14551 run:0 clone:125 gen:0 core:0x22 unit:0x0000000280fccb025e71637d587dab23
logs\log-20200320-071834.txt:05:16:17:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:11764 run:0 clone:5180 gen:1 core:0x22 unit:0x0000000380fccb0a5e71130f33021b57
logs\log-20200320-114942.txt:11:44:49:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:11759 run:0 clone:6793 gen:7 core:0x22 unit:0x0000000c80fccb0a5e6e96092ccc76d1
log.txt:16:14:28:WU04:FS03:Sending unit results: id:04 state:SEND error:FAULTY project:14533 run:0 clone:13508 gen:3 core:0x22 unit:0x0000000380fccb025e72f2f0d793be66
log.txt:16:16:46:WU02:FS03:Sending unit results: id:02 state:SEND error:FAULTY project:11776 run:0 clone:1911 gen:2 core:0x22 unit:0x00000003287234c95e73c47a4958b245
log.txt:05:41:54:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:11764 run:0 clone:2229 gen:11 core:0x22 unit:0x0000001280fccb0a5e6d81a6378f1eb4
log.txt:10:12:32:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:11776 run:0 clone:1538 gen:4 core:0x22 unit:0x0000000c287234c95e73c47d47fd7729
log.txt:10:13:08:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:11758 run:0 clone:833 gen:0 core:0x22 unit:0x000000099bf7a4d55e6d77116aaf791f
C:\Users\sam86\AppData\Roaming\FAHClient>findstr "ERROR:exception" logs\*.txt log.txt
logs\log-20200317-105142.txt:14:25:58:WU00:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
logs\log-20200317-105142.txt:04:34:50:WU03:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
logs\log-20200317-105142.txt:21:15:56:WU03:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
logs\log-20200317-105142.txt:22:16:23:WU01:FS03:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
logs\log-20200318-031735.txt:02:39:17:WU00:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
logs\log-20200320-071834.txt:05:16:17:WU00:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
logs\log-20200320-114942.txt:11:44:48:WU00:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
log.txt:16:06:35:WU02:FS03:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
log.txt:16:14:27:WU04:FS03:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
log.txt:05:41:53:WU00:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
log.txt:10:12:31:WU01:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
log.txt:10:13:07:WU02:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
Re: AMD GPU Error sortShortList on some projects
Posted: Sat Mar 21, 2020 8:00 pm
by sam6861
I found something more specific. All errors on my AMD GPU happens on projects with 165550 atoms or higher. Limit projects with this many atoms to NVidia GPU only? My NVidia GT 1030 works fine with this many atoms.
The most amount of atoms my AMD GPU have successfully completed was 110370 atoms from project 11763.
logs\log-20200320-071834.txt:16:54:49:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:11763 run:0 clone:692 gen:2 core:0x22 unit:0x0000000680fccb0a5e6d81254462ee3b
Re: AMD GPU Error sortShortList on some projects
Posted: Sat Mar 21, 2020 8:41 pm
by _r2w_ben
I had two similar failures on my RX 460 2GB on Windows 8.1 with Radeon driver 16.12.2. Both match your observation of being projects with 165550 or more atoms.
Code: Select all
13:50:55:WU00:FS01:0x22:*********************** Log Started 2020-03-16T13:50:55Z ***********************
13:50:55:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
13:50:55:WU00:FS01:0x22: Type: 0x22
13:50:55:WU00:FS01:0x22: Core: Core22
13:50:55:WU00:FS01:0x22: Website: https://foldingathome.org/
13:50:55:WU00:FS01:0x22: Copyright: (c) 2009-2018 foldingathome.org
13:50:55:WU00:FS01:0x22: Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
13:50:55:WU00:FS01:0x22: <rafal.wiewiora@choderalab.org>
13:50:55:WU00:FS01:0x22: Args: -dir 00 -suffix 01 -version 705 -lifeline 5332 -checkpoint 15
13:50:55:WU00:FS01:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
13:50:55:WU00:FS01:0x22: Config: <none>
13:50:55:WU00:FS01:0x22:************************************ Build *************************************
13:50:55:WU00:FS01:0x22: Version: 0.0.2
13:50:55:WU00:FS01:0x22: Date: Dec 6 2019
13:50:55:WU00:FS01:0x22: Time: 21:30:31
13:50:55:WU00:FS01:0x22: Repository: Git
13:50:55:WU00:FS01:0x22: Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe
13:50:55:WU00:FS01:0x22: Branch: HEAD
13:50:55:WU00:FS01:0x22: Compiler: Visual C++ 2008
13:50:55:WU00:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
13:50:55:WU00:FS01:0x22: Platform: win32 10
13:50:55:WU00:FS01:0x22: Bits: 64
13:50:55:WU00:FS01:0x22: Mode: Release
13:50:55:WU00:FS01:0x22:************************************ System ************************************
13:50:55:WU00:FS01:0x22: CPU: AMD A8-7600 Radeon R7, 10 Compute Cores 4C+6G
13:50:55:WU00:FS01:0x22: CPU ID: AuthenticAMD Family 21 Model 48 Stepping 1
13:50:55:WU00:FS01:0x22: CPUs: 4
13:50:55:WU00:FS01:0x22: Memory: 6.94GiB
13:50:55:WU00:FS01:0x22:Free Memory: 4.85GiB
13:50:55:WU00:FS01:0x22: Threads: WINDOWS_THREADS
13:50:55:WU00:FS01:0x22: OS Version: 6.2
13:50:55:WU00:FS01:0x22:Has Battery: false
13:50:55:WU00:FS01:0x22: On Battery: false
13:50:55:WU00:FS01:0x22: UTC Offset: -4
13:50:55:WU00:FS01:0x22: PID: 4640
13:50:55:WU00:FS01:0x22: CWD: C:\Users\Ben\AppData\Roaming\FAHClient\work
13:50:55:WU00:FS01:0x22: OS: Windows 8.1
13:50:55:WU00:FS01:0x22: OS Arch: AMD64
13:50:55:WU00:FS01:0x22:********************************************************************************
13:50:55:WU00:FS01:0x22:Project: 11758 (Run 0, Clone 2190, Gen 0)
13:50:55:WU00:FS01:0x22:Unit: 0x000000019bf7a4d55e6d77154a8b6c7e
13:50:55:WU00:FS01:0x22:Reading tar file core.xml
13:50:55:WU00:FS01:0x22:Reading tar file integrator.xml
13:50:55:WU00:FS01:0x22:Reading tar file state.xml
13:50:55:WU00:FS01:0x22:Reading tar file system.xml
13:50:56:WU00:FS01:0x22:Digital signatures verified
13:50:56:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
13:50:56:WU00:FS01:0x22:Version 0.0.2
13:51:16:WU00:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
13:51:16:WU00:FS01:0x22:Saving result file ..\logfile_01.txt
13:51:16:WU00:FS01:0x22:Saving result file science.log
13:51:16:WU00:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
13:51:16:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
13:51:16:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:11758 run:0 clone:2190 gen:0 core:0x22 unit:0x000000019bf7a4d55e6d77154a8b6c7e
Code: Select all
01:43:18:WU00:FS01:0x22:*********************** Log Started 2020-03-18T01:43:18Z ***********************
01:43:18:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
01:43:18:WU00:FS01:0x22: Type: 0x22
01:43:18:WU00:FS01:0x22: Core: Core22
01:43:18:WU00:FS01:0x22: Website: https://foldingathome.org/
01:43:18:WU00:FS01:0x22: Copyright: (c) 2009-2018 foldingathome.org
01:43:18:WU00:FS01:0x22: Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
01:43:18:WU00:FS01:0x22: <rafal.wiewiora@choderalab.org>
01:43:18:WU00:FS01:0x22: Args: -dir 00 -suffix 01 -version 705 -lifeline 1068 -checkpoint 15
01:43:18:WU00:FS01:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
01:43:18:WU00:FS01:0x22: Config: <none>
01:43:18:WU00:FS01:0x22:************************************ Build *************************************
01:43:18:WU00:FS01:0x22: Version: 0.0.2
01:43:18:WU00:FS01:0x22: Date: Dec 6 2019
01:43:18:WU00:FS01:0x22: Time: 21:30:31
01:43:18:WU00:FS01:0x22: Repository: Git
01:43:18:WU00:FS01:0x22: Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe
01:43:18:WU00:FS01:0x22: Branch: HEAD
01:43:18:WU00:FS01:0x22: Compiler: Visual C++ 2008
01:43:18:WU00:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
01:43:18:WU00:FS01:0x22: Platform: win32 10
01:43:18:WU00:FS01:0x22: Bits: 64
01:43:18:WU00:FS01:0x22: Mode: Release
01:43:18:WU00:FS01:0x22:************************************ System ************************************
01:43:18:WU00:FS01:0x22: CPU: AMD A8-7600 Radeon R7, 10 Compute Cores 4C+6G
01:43:18:WU00:FS01:0x22: CPU ID: AuthenticAMD Family 21 Model 48 Stepping 1
01:43:18:WU00:FS01:0x22: CPUs: 4
01:43:18:WU00:FS01:0x22: Memory: 6.94GiB
01:43:18:WU00:FS01:0x22:Free Memory: 5.58GiB
01:43:18:WU00:FS01:0x22: Threads: WINDOWS_THREADS
01:43:18:WU00:FS01:0x22: OS Version: 6.2
01:43:18:WU00:FS01:0x22:Has Battery: false
01:43:18:WU00:FS01:0x22: On Battery: false
01:43:18:WU00:FS01:0x22: UTC Offset: -4
01:43:18:WU00:FS01:0x22: PID: 4416
01:43:18:WU00:FS01:0x22: CWD: C:\Users\Ben\AppData\Roaming\FAHClient\work
01:43:18:WU00:FS01:0x22: OS: Windows 8.1
01:43:18:WU00:FS01:0x22: OS Arch: AMD64
01:43:18:WU00:FS01:0x22:********************************************************************************
01:43:18:WU00:FS01:0x22:Project: 11741 (Run 0, Clone 2624, Gen 6)
01:43:18:WU00:FS01:0x22:Unit: 0x0000000c8ca304f15e693a7373f6bad6
01:43:18:WU00:FS01:0x22:Reading tar file core.xml
01:43:18:WU00:FS01:0x22:Reading tar file integrator.xml
01:43:18:WU00:FS01:0x22:Reading tar file state.xml
01:43:20:WU00:FS01:0x22:Reading tar file system.xml
01:43:21:WU00:FS01:0x22:Digital signatures verified
01:43:21:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
01:43:21:WU00:FS01:0x22:Version 0.0.2
01:43:39:WU00:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
01:43:39:WU00:FS01:0x22:Saving result file ..\logfile_01.txt
01:43:39:WU00:FS01:0x22:Saving result file science.log
01:43:39:WU00:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
01:43:39:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
01:43:39:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:11741 run:0 clone:2624 gen:6 core:0x22 unit:0x0000000c8ca304f15e693a7373f6bad6
There's an interesting comment in OpenMM
source code when it's deciding on the length of a list that can be considered
short.
Code: Select all
unsigned int maxShortList = min(8192, max(maxLocalBuffer, (int) OpenCLContext::ThreadBlockSize*context.getNumThreadBlocks()));
// On Qualcomm's OpenCL, it's essential to check against CL_KERNEL_WORK_GROUP_SIZE. Otherwise you get a crash.
// But AMD's OpenCL returns an inappropriately small value for it that is much shorter than the actual
// maximum, so including the check hurts performance. For the moment I'm going to just comment it out.
// If we officially support Qualcomm in the future, we'll need to do something better.
//maxShortList = min(maxShortList, shortListKernel.getWorkGroupInfo<CL_KERNEL_WORK_GROUP_SIZE>(context.getDevice()));
isShortList = (length <= maxShortList);
Perhaps the list is slightly too large to fit in the maximum CL_KERNEL_WORK_GROUP_SIZE.
In case this becomes a bug report, here's the output from GPU Caps Viewer.
Code: Select all
[ OpenCL Capabilities ]
- Num OpenCL platforms: 1
- CL_PLATFORM_NAME: AMD Accelerated Parallel Processing
- CL_PLATFORM_VENDOR: Advanced Micro Devices, Inc.
- CL_PLATFORM_VERSION: OpenCL 2.0 AMD-APP (2348.4)
- CL_PLATFORM_PROFILE: FULL_PROFILE
- Num devices: 3
- CL_DEVICE_NAME: Baffin
- CL_DEVICE_VENDOR: Advanced Micro Devices, Inc.
- CL_DRIVER_VERSION: 2348.4
- CL_DEVICE_PROFILE: FULL_PROFILE
- CL_DEVICE_VERSION: OpenCL 1.2 AMD-APP (2348.4)
- CL_DEVICE_TYPE: GPU
- CL_DEVICE_VENDOR_ID: 0x1002
- CL_DEVICE_MAX_COMPUTE_UNITS: 14
- CL_DEVICE_MAX_CLOCK_FREQUENCY: 1210MHz
- CL_DEVICE_ADDRESS_BITS: 32
- CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1376256KB
- CL_DEVICE_GLOBAL_MEM_SIZE: 2048MB
- CL_DEVICE_MAX_PARAMETER_SIZE: 1024
- CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 64 Bytes
- CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 16KB
- CL_DEVICE_ERROR_CORRECTION_SUPPORT: NO
- CL_DEVICE_LOCAL_MEM_TYPE: Local (scratchpad)
- CL_DEVICE_LOCAL_MEM_SIZE: 32KB
- CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 1376256KB
- CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
- CL_DEVICE_MAX_WORK_ITEM_SIZES: [256 ; 256 ; 256]
- CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
- CL_EXEC_NATIVE_KERNEL: 10920428
- CL_DEVICE_IMAGE_SUPPORT: YES
- CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
- CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
- CL_DEVICE_IMAGE2D_MAX_WIDTH: 16384
- CL_DEVICE_IMAGE2D_MAX_HEIGHT: 16384
- CL_DEVICE_IMAGE3D_MAX_WIDTH: 2048
- CL_DEVICE_IMAGE3D_MAX_HEIGHT: 2048
- CL_DEVICE_IMAGE3D_MAX_DEPTH: 2048
- CL_DEVICE_MAX_SAMPLERS: 16
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR: 4
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT: 2
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT: 1
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG: 1
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: 1
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE: 1
- CL_DEVICE_EXTENSIONS: 25
- Extensions:
- cl_khr_fp64
- cl_amd_fp64
- cl_khr_global_int32_base_atomics
- cl_khr_global_int32_extended_atomics
- cl_khr_local_int32_base_atomics
- cl_khr_local_int32_extended_atomics
- cl_khr_int64_base_atomics
- cl_khr_int64_extended_atomics
- cl_khr_3d_image_writes
- cl_khr_byte_addressable_store
- cl_khr_fp16
- cl_khr_gl_sharing
- cl_amd_device_attribute_query
- cl_amd_vec3
- cl_amd_printf
- cl_amd_media_ops
- cl_amd_media_ops2
- cl_amd_popcnt
- cl_khr_d3d10_sharing
- cl_khr_d3d11_sharing
- cl_khr_dx9_media_sharing
- cl_khr_image2d_from_buffer
- cl_khr_spir
- cl_khr_gl_event
- cl_amd_liquid_flash
Re: AMD GPU Error sortShortList on some projects
Posted: Sat Mar 21, 2020 8:53 pm
by alxbelu
I've observed the same issues with my R9 290x (4gb) on Win10, specifically for projects 11759, 11746, 11747 and 14533, which are indeed all projects with more than 165550 atoms. I had struggled to see the pattern why they failed, but they all fail with the same error and immediately after starting, e.g:
Code: Select all
12:26:54:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:11759 run:0 clone:2210 gen:14 core:0x22 unit:0x0000001480fccb0a5e6d7c9833f7cd06
12:26:54:WU03:FS01:Starting
12:26:54:WU03:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\alxbelu\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe -dir 03 -suffix 01 -version 705 -lifeline 15036 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
12:26:54:WU03:FS01:Started FahCore on PID 18348
12:26:54:WU03:FS01:Core PID:21604
12:26:54:WU03:FS01:FahCore 0x22 started
12:26:55:WU03:FS01:0x22:*********************** Log Started 2020-03-21T12:26:54Z ***********************
12:26:55:WU03:FS01:0x22:*************************** Core22 Folding@home Core ***************************
12:26:55:WU03:FS01:0x22: Type: 0x22
12:26:55:WU03:FS01:0x22: Core: Core22
12:26:55:WU03:FS01:0x22: Website: https://foldingathome.org/
12:26:55:WU03:FS01:0x22: Copyright: (c) 2009-2018 foldingathome.org
12:26:55:WU03:FS01:0x22: Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
12:26:55:WU03:FS01:0x22: <rafal.wiewiora@choderalab.org>
12:26:55:WU03:FS01:0x22: Args: -dir 03 -suffix 01 -version 705 -lifeline 18348 -checkpoint 15
12:26:55:WU03:FS01:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
12:26:55:WU03:FS01:0x22: Config: <none>
12:26:55:WU03:FS01:0x22:************************************ Build *************************************
12:26:55:WU03:FS01:0x22: Version: 0.0.2
12:26:55:WU03:FS01:0x22: Date: Dec 6 2019
12:26:55:WU03:FS01:0x22: Time: 21:30:31
12:26:55:WU03:FS01:0x22: Repository: Git
12:26:55:WU03:FS01:0x22: Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe
12:26:55:WU03:FS01:0x22: Branch: HEAD
12:26:55:WU03:FS01:0x22: Compiler: Visual C++ 2008
12:26:55:WU03:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
12:26:55:WU03:FS01:0x22: Platform: win32 10
12:26:55:WU03:FS01:0x22: Bits: 64
12:26:55:WU03:FS01:0x22: Mode: Release
12:26:55:WU03:FS01:0x22:************************************ System ************************************
12:26:55:WU03:FS01:0x22: CPU: Intel(R) Core(TM) i5-4670K CPU @ 3.40GHz
12:26:55:WU03:FS01:0x22: CPU ID: GenuineIntel Family 6 Model 60 Stepping 3
12:26:55:WU03:FS01:0x22: CPUs: 4
12:26:55:WU03:FS01:0x22: Memory: 23.88GiB
12:26:55:WU03:FS01:0x22:Free Memory: 16.84GiB
12:26:55:WU03:FS01:0x22: Threads: WINDOWS_THREADS
12:26:55:WU03:FS01:0x22: OS Version: 6.2
12:26:55:WU03:FS01:0x22:Has Battery: false
12:26:55:WU03:FS01:0x22: On Battery: false
12:26:55:WU03:FS01:0x22: UTC Offset: 1
12:26:55:WU03:FS01:0x22: PID: 21604
12:26:55:WU03:FS01:0x22: CWD: C:\Users\alxbelu\AppData\Roaming\FAHClient\work
12:26:55:WU03:FS01:0x22: OS: Windows 10 Pro
12:26:55:WU03:FS01:0x22: OS Arch: AMD64
12:26:55:WU03:FS01:0x22:********************************************************************************
12:26:55:WU03:FS01:0x22:Project: 11759 (Run 0, Clone 2210, Gen 14)
12:26:55:WU03:FS01:0x22:Unit: 0x0000001480fccb0a5e6d7c9833f7cd06
12:26:55:WU03:FS01:0x22:Reading tar file core.xml
12:26:55:WU03:FS01:0x22:Reading tar file integrator.xml
12:26:55:WU03:FS01:0x22:Reading tar file state.xml
12:26:55:WU03:FS01:0x22:Reading tar file system.xml
12:26:56:WU03:FS01:0x22:Digital signatures verified
12:26:56:WU03:FS01:0x22:Folding@home GPU Core22 Folding@home Core
12:26:56:WU03:FS01:0x22:Version 0.0.2
12:27:09:WU03:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
I also tried updating my AMD drivers to the latest version (20.3.1) (as someone recommended them here on the forum), without any effect (otoh the reported driver to FAH doesn't seem to have changed since the driver update and is still reporting "OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:3004.8", but I'm assuming that the OpenCL driver simply didn't change from my previous version)
edit: also attaching GPU Caps OpenCL info:
Code: Select all
===================================[ OpenCL Capabilities ]
- Num OpenCL platforms: 2
- CL_PLATFORM_NAME: AMD Accelerated Parallel Processing
- CL_PLATFORM_VENDOR: Advanced Micro Devices, Inc.
- CL_PLATFORM_VERSION: OpenCL 2.1 AMD-APP (3004.8)
- CL_PLATFORM_PROFILE: FULL_PROFILE
- Num devices: 1
- CL_DEVICE_NAME: Hawaii
- CL_DEVICE_VENDOR: Advanced Micro Devices, Inc.
- CL_DRIVER_VERSION: 3004.8
- CL_DEVICE_PROFILE: FULL_PROFILE
- CL_DEVICE_VERSION: OpenCL 1.2 AMD-APP (3004.8)
- CL_DEVICE_TYPE: GPU
- CL_DEVICE_VENDOR_ID: 0x1002
- CL_DEVICE_MAX_COMPUTE_UNITS: 44
- CL_DEVICE_MAX_CLOCK_FREQUENCY: 1500MHz
- CL_DEVICE_ADDRESS_BITS: 32
- CL_DEVICE_MAX_MEM_ALLOC_SIZE: 3145728KB
- CL_DEVICE_GLOBAL_MEM_SIZE: 3072MB
- CL_DEVICE_MAX_PARAMETER_SIZE: 1024
- CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 64 Bytes
- CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 16KB
- CL_DEVICE_ERROR_CORRECTION_SUPPORT: NO
- CL_DEVICE_LOCAL_MEM_TYPE: Local (scratchpad)
- CL_DEVICE_LOCAL_MEM_SIZE: 32KB
- CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 3145728KB
- CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
- CL_DEVICE_MAX_WORK_ITEM_SIZES: [1024 ; 1024 ; 1024]
- CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
- CL_EXEC_NATIVE_KERNEL: 12886508
- CL_DEVICE_IMAGE_SUPPORT: YES
- CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
- CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
- CL_DEVICE_IMAGE2D_MAX_WIDTH: 16384
- CL_DEVICE_IMAGE2D_MAX_HEIGHT: 16384
- CL_DEVICE_IMAGE3D_MAX_WIDTH: 2048
- CL_DEVICE_IMAGE3D_MAX_HEIGHT: 2048
- CL_DEVICE_IMAGE3D_MAX_DEPTH: 2048
- CL_DEVICE_MAX_SAMPLERS: 16
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR: 4
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT: 2
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT: 1
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG: 1
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: 1
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE: 1
- CL_DEVICE_EXTENSIONS: 24
- Extensions:
- cl_khr_fp64
- cl_amd_fp64
- cl_khr_global_int32_base_atomics
- cl_khr_global_int32_extended_atomics
- cl_khr_local_int32_base_atomics
- cl_khr_local_int32_extended_atomics
- cl_khr_int64_base_atomics
- cl_khr_int64_extended_atomics
- cl_khr_3d_image_writes
- cl_khr_byte_addressable_store
- cl_khr_gl_sharing
- cl_amd_device_attribute_query
- cl_amd_vec3
- cl_amd_printf
- cl_amd_media_ops
- cl_amd_media_ops2
- cl_amd_popcnt
- cl_khr_d3d10_sharing
- cl_khr_d3d11_sharing
- cl_khr_dx9_media_sharing
- cl_khr_image2d_from_buffer
- cl_khr_spir
- cl_khr_gl_event
- cl_amd_liquid_flash
Re: AMD GPU Error sortShortList on some projects
Posted: Sat Mar 21, 2020 11:49 pm
by _r2w_ben
I was reading through more source code and came across references to values from AMD OpenCL extensions that were not included in the output from standard clinfo. Here is the output from another
clinfo tool that tries to pull all possible values.
Code: Select all
Device Name Baffin
Device Vendor Advanced Micro Devices, Inc.
Device Vendor ID 0x1002
Device Version OpenCL 2.0 AMD-APP (2348.4)
Driver Version 2348.4
Device OpenCL C Version OpenCL C 2.0
Device Type GPU
Device Board Name (AMD) Radeon(TM) RX 460 Graphics
Device Topology (AMD) PCI-E, 01:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 14
SIMD per compute unit (AMD) 4
SIMD width (AMD) 16
SIMD instruction width (AMD) 1
Max clock frequency 1210MHz
Graphics IP (AMD) 8.0
Device Partition (core)
Max number of sub-devices 14
Supported partition types (n/a)
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size (AMD) <printDeviceInfo:40: get CL_DEVICE_PREFERRED_WORK_GROUP_SIZE_AMD : error -30>
Max work group size (AMD) <printDeviceInfo:41: get CL_DEVICE_MAX_WORK_GROUP_SIZE_AMD : error -30>
Preferred work group size multiple 64
Wavefront width (AMD) 64
Preferred / native vector sizes
char 4 / 4
short 2 / 2
int 1 / 1
long 1 / 1
half 1 / 1 (cl_khr_fp16)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs No
Round to nearest No
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 2147483648 (2GiB)
Global free memory (AMD) 2045385 (1.951GiB)
Global memory channels (AMD) 4
Global memory banks per channel (AMD) 16
Global memory bank width (AMD) 256 bytes
Error Correction support No
Max memory allocation 1409286144 (1.313GiB)
Unified memory for Host and Device No
Shared Virtual Memory (SVM) capabilities (core)
Coarse-grained buffer sharing Yes
Fine-grained buffer sharing Yes
Fine-grained system sharing No
Atomics No
Minimum alignment for any data type 128 bytes
Alignment of base address 2048 bits (256 bytes)
Preferred alignment for atomics
SVM 0 bytes
Global 0 bytes
Local 0 bytes
Max size for global variable 1268357376 (1.181GiB)
Preferred total size of global vars 2147483648 (2GiB)
Global Memory cache type Read/Write
Global Memory cache size 16384 (16KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 256 bytes
Pitch alignment for 2D image buffers 256 pixels
Max 2D image size 16384x16384 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 128
Max number of write image args 64
Max number of read/write image args 64
Max number of pipe args 16
Max active pipe reservations 16
Max pipe packet size 1409286144 (1.313GiB)
Local memory type Local
Local memory size 32768 (32KiB)
Local memory syze per CU (AMD) 65536 (64KiB)
Local memory banks (AMD) 32
Max number of constant args 8
Max constant buffer size 1409286144 (1.313GiB)
Preferred constant buffer size (AMD) <printDeviceInfo:133: get CL_DEVICE_PREFERRED_CONSTANT_BUFFER_SIZE_AMD : error -30>
Max size of kernel argument 1024
Queue properties (on host)
Out-of-order execution No
Profiling Yes
Queue properties (on device)
Out-of-order execution Yes
Profiling Yes
Preferred size 262144 (256KiB)
Max size 8388608 (8MiB)
Max queues on device 1
Max events on device 1024
Prefer user sync for interop Yes
Profiling timer resolution 1ns
Profiling timer offset since Epoch (AMD) 1584799148102833223ns (Sat Mar 21 09:59:08 2020)
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Thread trace supported (AMD) Yes
Number of async queues (AMD) 2
Max real-time compute queues (AMD) 0
Max real-time compute units (AMD) 0
SPIR versions 1.2
printf() buffer size 1048576 (1024KiB)
Built-in kernels (n/a)
Device Extensions cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_khr_gl_depth_images cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_spir cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_liquid_flash
Values of constants for RX 460:
OpenCLSort.cpp
CL_DEVICE_MAX_WORK_GROUP_SIZE = 256
CL_DEVICE_LOCAL_MEM_SIZE = 32768
CL_KERNEL_WORK_GROUP_SIZE = ?
OpenCLContext.cpp
CL_DEVICE_SIMD_PER_COMPUTE_UNIT_AMD = 4
CL_DEVICE_WAVEFRONT_WIDTH_AMD = 64
Re: AMD GPU Error sortShortList on some projects
Posted: Sun Mar 22, 2020 8:58 am
by somata
Nice detective work _r2w_ben! Been having the same issue on my RX 480, and made the connection to larger simulations but didn't go so far as to look at the OpenMM code. Perhaps this insight can help Pande Labs/OpenMM/AMD figure out a solution.
Re: AMD GPU Error sortShortList on some projects
Posted: Sun Mar 22, 2020 6:16 pm
by bruce
I'd say this is one of the disadvantages of supporting a wealth of different GPU models with a single hardware device code.
Surely the ID 0x1002:0x67df (also known as the Ellesmere XT [Radeon RX 470/480/570/580]) represents some variation in hardware capabilites.
FAH treats them as identical hardware.
Re: AMD GPU Error sortShortList on some projects
Posted: Mon Mar 23, 2020 10:34 am
by _r2w_ben
bruce wrote:I'd say this is one of the disadvantages of supporting a wealth of different GPU models with a single hardware device code.
Surely the ID 0x1002:0x67df (also known as the Ellesmere XT [Radeon RX 470/480/570/580]) represents some variation in hardware capabilites.
FAH treats them as identical hardware.
I don't think that's the problem here. These GPUs are all very similar. 470 is a cut down 480. 570 and 580 are minor tweaks to power and clock speeds of the same GPU launched the following year.
https://www.anandtech.com/show/11278/am ... 570-review
OpenMM appears to make an incorrect guess of the memory size available within a CU on Polaris 10 and Polaris 11. The researchers tend to have access to higher-end hardware so this is understandable.
peastman would probably be the most able to confirm this.
Re: AMD GPU Error sortShortList on some projects
Posted: Mon Mar 23, 2020 1:52 pm
by alxbelu
_r2w_ben wrote:bruce wrote:I'd say this is one of the disadvantages of supporting a wealth of different GPU models with a single hardware device code.
Surely the ID 0x1002:0x67df (also known as the Ellesmere XT [Radeon RX 470/480/570/580]) represents some variation in hardware capabilites.
FAH treats them as identical hardware.
I don't think that's the problem here. These GPUs are all very similar. 470 is a cut down 480. 570 and 580 are minor tweaks to power and clock speeds of the same GPU launched the following year.
https://www.anandtech.com/show/11278/am ... 570-review
OpenMM appears to make an incorrect guess of the memory size available within a CU on Polaris 10 and Polaris 11. The researchers tend to have access to higher-end hardware so this is understandable.
peastman would probably be the most able to confirm this.
As mentioned I seem to be experiencing the same issue on an R9 290x as well, though I guess technically they are all GCN-based, albeit different generations.
edit: adding clinfo for comparisons sake
Code: Select all
Number of platforms 2
Platform Name AMD Accelerated Parallel Processing
Platform Vendor Advanced Micro Devices, Inc.
Platform Version OpenCL 2.1 AMD-APP (3004.8)
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_amd_event_callback cl_amd_offline_devices
Platform Host timer resolution 100ns
Platform Extensions function suffix AMD
Platform Name AMD Accelerated Parallel Processing
Number of devices 1
Device Name Hawaii
Device Vendor Advanced Micro Devices, Inc.
Device Vendor ID 0x1002
Device Version OpenCL 2.0 AMD-APP (3004.8)
Driver Version 3004.8
Device OpenCL C Version OpenCL C 2.0
Device Type GPU
Device Board Name (AMD) AMD Radeon R9 200 Series
Device Topology (AMD) PCI-E, 01:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 44
SIMD per compute unit (AMD) 4
SIMD width (AMD) 16
SIMD instruction width (AMD) 1
Max clock frequency 1000MHz
Graphics IP (AMD) 7.2
Device Partition (core)
Max number of sub-devices 44
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 1024x1024x1024
Max work group size 256
Preferred work group size (AMD) 256
Max work group size (AMD) 1024
Preferred work group size multiple 64
Wavefront width (AMD) 64
Preferred / native vector sizes
char 4 / 4
short 2 / 2
int 1 / 1
long 1 / 1
half 1 / 1 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 4294967296 (4GiB)
Global free memory (AMD) 4142762 (3.951GiB)
Global memory channels (AMD) 16
Global memory banks per channel (AMD) 16
Global memory bank width (AMD) 256 bytes
Error Correction support No
Max memory allocation 3422552064 (3.188GiB)
Unified memory for Host and Device No
Shared Virtual Memory (SVM) capabilities (core)
Coarse-grained buffer sharing Yes
Fine-grained buffer sharing Yes
Fine-grained system sharing No
Atomics No
Minimum alignment for any data type 128 bytes
Alignment of base address 2048 bits (256 bytes)
Preferred alignment for atomics
SVM 0 bytes
Global 0 bytes
Local 0 bytes
Max size for global variable 3080296704 (2.869GiB)
Preferred total size of global vars 4294967296 (4GiB)
Global Memory cache type Read/Write
Global Memory cache size 16384 (16KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 256 bytes
Pitch alignment for 2D image buffers 256 pixels
Max 2D image size 16384x16384 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 128
Max number of write image args 64
Max number of read/write image args 64
Max number of pipe args 16
Max active pipe reservations 16
Max pipe packet size 3422552064 (3.188GiB)
Local memory type Local
Local memory size 32768 (32KiB)
Local memory syze per CU (AMD) 65536 (64KiB)
Local memory banks (AMD) 32
Max number of constant args 8
Max constant buffer size 3422552064 (3.188GiB)
Preferred constant buffer size (AMD) 16384 (16KiB)
Max size of kernel argument 1024
Queue properties (on host)
Out-of-order execution No
Profiling Yes
Queue properties (on device)
Out-of-order execution Yes
Profiling Yes
Preferred size 262144 (256KiB)
Max size 8388608 (8MiB)
Max queues on device 1
Max events on device 1024
Prefer user sync for interop Yes
Profiling timer resolution 1ns
Profiling timer offset since Epoch (AMD) 1584955033209549400ns (Mon Mar 23 10:17:13 2020)
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Thread trace supported (AMD) Yes
Number of async queues (AMD) 2
Max real-time compute queues (AMD) 2
Max real-time compute units (AMD) 12
SPIR versions 1.2
printf() buffer size 4194304 (4MiB)
Built-in kernels (n/a)
Device Extensions cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_khr_gl_depth_images cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_spir cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_liquid_flash cl_amd_planar_yuv
Re: AMD GPU Error sortShortList on some projects
Posted: Mon Mar 23, 2020 7:49 pm
by muziqaz
These were failing on Radeon 7 and Vega64 as well. This is some sort of issue with OpenCL config on these projects. Author of the projects is aware of the issue, but with everything what is going on in fah world right now, its kinda hard to concentrate to find the issue.
Simulation size is not a cause for this, as AMD GPUs are known to work better in large atom count simulations. AMD GPUs worked flawlessly with atom count up to 260k. One of the projects was quite recent and is still circulating, can't remember the number.
Also, those failing projects have been tested on very old AMD APU with integrated HD8250 class GPU with ancient opencl version (1800), and what do you know, it is folding.
This is AMD only issue unfortunately.
none the less, nice investigative work there
Re: AMD GPU Error sortShortList on some projects
Posted: Mon Mar 23, 2020 8:05 pm
by alxbelu
muziqaz wrote:These were failing on Radeon 7 and Vega64 as well. This is some sort of issue with OpenCL config on these projects. Author of the projects is aware of the issue, but with everything what is going on in fah world right now, its kinda hard to concentrate to find the issue.
Simulation size is not a cause for this, as AMD GPUs are known to work better in large atom count simulations. AMD GPUs worked flawlessly with atom count up to 260k. One of the projects was quite recent and is still circulating, can't remember the number.
Also, those failing projects have been tested on very old AMD APU with integrated HD8250 class GPU with ancient opencl version (1800), and what do you know, it is folding.
This is AMD only issue unfortunately.
none the less, nice investigative work there
Was the 260k atom count also with Core22? (I'm new, but got the impression that Core22 was fairly new as well?)
Worth mentioning is that there seems to be at least six different project owners to the projects that have failed here: rafal.wiewiora, voelz, vithanin, totowfr, joseph and jrporter.
(I have personally got logs of: 11746, 11747, 11752, 11759, 11764, 11776, 11781, 14533. But OP has additionally noted 11758 and 14551)
edit: bruce mentioned it had been escalated to the OpenMM team in this thread: viewtopic.php?f=81&t=32771
Re: AMD GPU Error sortShortList on some projects
Posted: Mon Mar 23, 2020 8:19 pm
by muziqaz
alxbelu wrote:muziqaz wrote:These were failing on Radeon 7 and Vega64 as well. This is some sort of issue with OpenCL config on these projects. Author of the projects is aware of the issue, but with everything what is going on in fah world right now, its kinda hard to concentrate to find the issue.
Simulation size is not a cause for this, as AMD GPUs are known to work better in large atom count simulations. AMD GPUs worked flawlessly with atom count up to 260k. One of the projects was quite recent and is still circulating, can't remember the number.
Also, those failing projects have been tested on very old AMD APU with integrated HD8250 class GPU with ancient opencl version (1800), and what do you know, it is folding.
This is AMD only issue unfortunately.
none the less, nice investigative work there
Was the 260k atom count also with Core22? (I'm new, but got the impression that Core22 was fairly new as well?)
Worth mentioning is that there seems to be at least six different project owners to the projects that have failed here: rafal.wiewiora, voelz, vithanin, totowfr, joseph and jrporter.
(I have personally got logs of: 11746, 11747, 11752, 11759, 11764, 11776, 11781, 14533. But OP has additionally noted 11758 and 14551)
edit: bruce mentioned it had been escalated to the OpenMM team in this thread: viewtopic.php?f=81&t=32771
p11738 Atom count 287229. OpenMM core_22
Worked flawlessly on my Radeon 7 with 2.2m PPD
All the listed failing projects have same roots, and owners share things between each other. If they didn't they would be stepping on each other's toes
And yes, researchers are aware of these
Thank you
Re: AMD GPU Error sortShortList on some projects
Posted: Mon Mar 23, 2020 8:37 pm
by alxbelu
muziqaz wrote:
All the listed failing projects have same roots, and owners share things between each other. If they didn't they would be stepping on each other's toes
And yes, researchers are aware of these
Alright, obviously makes sense
Thanks for clarifying!
Re: AMD GPU Error sortShortList on some projects
Posted: Tue Mar 24, 2020 9:42 am
by sam6861
9 more errors, now with projects 11747, 11758, 11759, 11776, 11781, 14533
My Hardware ID shown in device manager.
(FS01) AMD RX 580: PCI\VEN_1002&DEV_67DF&SUBSYS_C5801682&REV_E7
(FS03) Vega 11: PCI\VEN_1002&DEV_15DD&SUBSYS_876B1043&REV_C6
My other computer (windows 10, AMD driver 19.2.3) have a somewhat slow GPU and slow download of work units, it completed just 5 work units and haven't downloaded any faulty project numbers which might be good so far.
AMD RX 550: PCI\VEN_1002&DEV_67FF&SUBSYS_E3671DA2&REV_FF
Going farther, I happen to copy a failed work unit and manually run FahCore_22 command line. This specific work unit fails all the time on all 3 of my different AMD GPU (RX 580, RX 550, Vega 11) with sortShortList error. Using gpu-vendor nvidia on computer with NVidia GPU (my GT 1030) appears to work fine, using NVidia CUDA processing, with log showing percent complete, until I manually quit out with ctrl + C.
Maybe this sortShortList error on affected project numbers happens to all AMD GCN?
Code: Select all
logs\log-20200321-112208.txt:10:12:32:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:11776 run:0 clone:1538 gen:4 core:0x22 unit:0x0000000c287234c95e73c47d47fd7729
logs\log-20200321-112208.txt:10:13:08:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:11758 run:0 clone:833 gen:0 core:0x22 unit:0x000000099bf7a4d55e6d77116aaf791f
logs\log-20200324-084817.txt:14:28:13:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:14533 run:0 clone:3956 gen:2 core:0x22 unit:0x0000000680fccb025e72f222517f50f6
logs\log-20200324-084817.txt:09:18:55:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:11747 run:0 clone:5750 gen:4 core:0x22 unit:0x000000098ca304e75e6bab8cb89c5ccb
logs\log-20200324-084817.txt:05:48:26:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:11759 run:0 clone:7605 gen:11 core:0x22 unit:0x0000001180fccb0a5e6ea0d3aa6b008e
logs\log-20200324-084817.txt:06:34:47:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:11781 run:0 clone:2504 gen:7 core:0x22 unit:0x000000090d5a98395e73c51775cd22d4
logs\log-20200324-084817.txt:12:28:20:WU04:FS01:Sending unit results: id:04 state:SEND error:FAULTY project:11747 run:0 clone:515 gen:9 core:0x22 unit:0x000000118ca304e75e6a7fc710aa491e
logs\log-20200324-084817.txt:18:01:21:WU03:FS03:Sending unit results: id:03 state:SEND error:FAULTY project:11758 run:0 clone:1241 gen:0 core:0x22 unit:0x000000099bf7a4d55e6d77129c1e48c9
logs\log-20200324-084817.txt:08:12:14:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:11776 run:0 clone:7964 gen:2 core:0x22 unit:0x00000006287234c95e74337cabc82607
C:\Users\sam86\AppData\Roaming\FAHClient>findstr "ERROR:exception" logs\*.txt log.txt
logs\log-20200321-112208.txt:10:12:31:WU01:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
logs\log-20200321-112208.txt:10:13:07:WU02:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
logs\log-20200324-084817.txt:14:28:12:WU01:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
logs\log-20200324-084817.txt:09:18:54:WU01:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
logs\log-20200324-084817.txt:05:48:26:WU01:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
logs\log-20200324-084817.txt:06:34:46:WU02:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
logs\log-20200324-084817.txt:12:28:19:WU04:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
logs\log-20200324-084817.txt:18:01:19:WU03:FS03:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
logs\log-20200324-084817.txt:08:12:14:WU02:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
C:\Users\sam86\AppData\Roaming\FAHClient\TEST1\99>C:\Users\sam86\AppData\Roaming\FAHClient\cores\cores.foldingathome.org\v7\win\64bit\Core_22.fah\FahCore_22.exe -dir "C:\Users\sam86\AppData\Roaming\FAHClient\TEST1\99" -suffix 01 -version 705 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
C:\Users\sam86\AppData\Roaming\FAHClient\TEST1\99>type logfile_01.txt
*********************** Log Started 2020-03-24T09:07:46Z ***********************
*************************** Core22 Folding@home Core ***************************
Type: 0x22
Core: Core22
Copyright: (c) 2009-2018 foldingathome.org
Author: John Chodera <john.chodera ...> and Rafal Wiewiora
<rafal.wiewiora ...>
Args: -dir C:\Users\sam86\AppData\Roaming\FAHClient\TEST1\99 -suffix 01
-version 705 -checkpoint 15 -gpu-vendor amd -opencl-platform 0
-opencl-device 0 -gpu 0
Config: <none>
************************************ Build *************************************
Version: 0.0.2
Date: Dec 6 2019
Time: 21:30:31
Repository: Git
Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe
Branch: HEAD
Compiler: Visual C++ 2008
Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
Platform: win32 10
Bits: 64
Mode: Release
************************************ System ************************************
CPU: AMD Ryzen 5 2400G with Radeon Vega Graphics
CPU ID: AuthenticAMD Family 23 Model 17 Stepping 0
CPUs: 8
Memory: 31.81GiB
Free Memory: 26.16GiB
Threads: WINDOWS_THREADS
OS Version: 6.2
Has Battery: false
On Battery: false
UTC Offset: -5
PID: 4740
CWD: C:\Users\sam86\AppData\Roaming\FAHClient\TEST1\99
OS: Windows 10 Home
OS Arch: AMD64
********************************************************************************
Project: 11776 (Run 0, Clone 1911, Gen 2)
Unit: 0x00000003287234c95e73c47a4958b245
Reading tar file core.xml
Reading tar file integrator.xml
Reading tar file state.xml
Reading tar file system.xml
Digital signatures verified
Folding@home GPU Core22 Folding@home Core
Version 0.0.2
ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
Saving result file ..\logfile_01.txt
Saving result file science.log
Folding@home Core Shutdown: BAD_WORK_UNIT
Re: AMD GPU Error sortShortList on some projects
Posted: Tue Mar 24, 2020 12:51 pm
by muziqaz
Researchers are looking into disabling those projects on AMD GPUs until fix has been found.
Thank you for understanding