Project 17435 (Run 0, Clone 172, Gen 111) Stalled Before 0%
Posted: Tue Mar 30, 2021 6:07 am
Head of log:
Work unit details and stall:
Work unit refuses to release GPU:
systemd unable to stop FAHClient:
This work unit stalled as soon as it had a lock on the GPU, which it ran to ~100% capacity. The GPU continued to be utilized and a shutdown of the system stalled for more than 30 minutes, at which time it was hard reset. The work unit had to be dumped manually before folding could resume. The system is a stock CentOS 8.3 desktop running kernel 4.18.0-240.1.1.el8_3.x86_64 and libopencl-amdgpu-pro.x86_64 19.50. Both newer kernels and newer amdgpu-pro have resulted in complete failure to run Folding@home on this hardware. Other work units from project 17435 and its series have demonstrated similar behavior. What can be done to guard against a work unit that misbehaves so badly as to consume 100% GPU - without doing any real work - while preventing a system from shutting down?
viewtopic.php?f=108&t=36871#p349658
Code: Select all
*********************** Log Started 2021-03-29T19:10:22Z ***********************
2021-03-29:19:10:22:************************** libFAH **************************
2021-03-29:19:10:22: Date: Oct 20 2020
2021-03-29:19:10:22: Time: 20:36:41
2021-03-29:19:10:22: Revision: 5ca109d295a6245e2a2f590b3d0085ad5e567aeb
2021-03-29:19:10:22: Branch: master
2021-03-29:19:10:22: Compiler: GNU 4.9.4
2021-03-29:19:10:22: Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections
2021-03-29:19:10:22: -O3 -funroll-loops
2021-03-29:19:10:22: Platform: linux2 5.8.0-1-amd64
2021-03-29:19:10:22: Bits: 64
2021-03-29:19:10:22: Mode: Release
2021-03-29:19:10:22:************************ FAHClient *************************
2021-03-29:19:10:22: Version: 7.6.21
2021-03-29:19:10:22: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
2021-03-29:19:10:22: Copyright: 2020 foldingathome.org
2021-03-29:19:10:22: Homepage: https://foldingathome.org/
2021-03-29:19:10:22: Date: Oct 20 2020
2021-03-29:19:10:22: Time: 20:38:59
2021-03-29:19:10:22: Revision: 6efbf0e138e22d3963e6a291f78dcb9c6422a278
2021-03-29:19:10:22: Branch: master
2021-03-29:19:10:22: Compiler: GNU 4.9.4
2021-03-29:19:10:22: Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections
2021-03-29:19:10:22: -O3 -funroll-loops
2021-03-29:19:10:22: Platform: linux2 5.8.0-1-amd64
2021-03-29:19:10:22: Bits: 64
2021-03-29:19:10:22: Mode: Release
2021-03-29:19:10:22: Args: --config=/etc/fahclient/config.xml --chdir=/var/lib/fahclient/
2021-03-29:19:10:22: Config: /etc/fahclient/config.xml
2021-03-29:19:10:22:************************** CBang ***************************
2021-03-29:19:10:22: Date: Oct 20 2020
2021-03-29:19:10:22: Time: 18:38:01
2021-03-29:19:10:22: Revision: 7e4ce85225d7eaeb775e87c31740181ca603de60
2021-03-29:19:10:22: Branch: master
2021-03-29:19:10:22: Compiler: GNU 4.9.4
2021-03-29:19:10:22: Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections
2021-03-29:19:10:22: -O3 -funroll-loops -fPIC
2021-03-29:19:10:22: Platform: linux2 5.8.0-1-amd64
2021-03-29:19:10:22: Bits: 64
2021-03-29:19:10:22: Mode: Release
2021-03-29:19:10:22:************************** System **************************
2021-03-29:19:10:22: CPU: AMD Ryzen Threadripper 2990WX 32-Core Processor
2021-03-29:19:10:22: CPU ID: AuthenticAMD Family 23 Model 8 Stepping 2
2021-03-29:19:10:22: CPUs: 64
2021-03-29:19:10:22: Memory: 125.51GiB
2021-03-29:19:10:22: Free Memory: 124.62GiB
2021-03-29:19:10:22: Threads: POSIX_THREADS
2021-03-29:19:10:22: OS Version: 4.18
2021-03-29:19:10:22: Has Battery: false
2021-03-29:19:10:22: On Battery: false
2021-03-29:19:10:22: UTC Offset: -7
2021-03-29:19:10:22: PID: 1661
2021-03-29:19:10:22: CWD: /var/lib/fahclient
2021-03-29:19:10:22: OS: Linux 4.18.0-240.1.1.el8_3.x86_64 x86_64
2021-03-29:19:10:22: OS Arch: AMD64
2021-03-29:19:10:22: GPUs: 2
2021-03-29:19:10:22: GPU 0: Bus:10 Slot:0 Func:0 AMD:5 Vega 10 XL/XT [Radeon RX Vega 56/64]
2021-03-29:19:10:22: GPU 1: Bus:69 Slot:0 Func:0 AMD:5 Vega 10 XL/XT [Radeon RX Vega 56/64]
2021-03-29:19:10:22: CUDA: Not detected: Failed to open dynamic library 'libcuda.so':
2021-03-29:19:10:22: libcuda.so: cannot open shared object file: No such file or
2021-03-29:19:10:22: directory
2021-03-29:19:10:22:OpenCL Device 0: Platform:0 Device:0 Bus:69 Slot:0 Compute:2.0 Driver:3004.6
2021-03-29:19:10:22:OpenCL Device 1: Platform:0 Device:1 Bus:10 Slot:0 Compute:2.0 Driver:3004.6
2021-03-29:19:10:22:************************************************************
2021-03-29:19:10:22:<config>
2021-03-29:19:10:22: <!-- Error Handling -->
2021-03-29:19:10:22: <max-slot-errors v='20'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22: <!-- Folding Core -->
2021-03-29:19:10:22: <checkpoint v='5'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22: <!-- Folding Slot Configuration -->
2021-03-29:19:10:22: <client-type v='advanced'/>
2021-03-29:19:10:22: <cpus v='54'/>
2021-03-29:19:10:22: <disable-viz v='true'/>
2021-03-29:19:10:22: <max-packet-size v='big'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22: <!-- GUI -->
2021-03-29:19:10:22: <gui-enabled v='false'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22: <!-- HTTP Server -->
2021-03-29:19:10:22: <max-connect-time v='604800'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22: <!-- Logging -->
2021-03-29:19:10:22: <log-date v='true'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22: <!-- Remote Command Server -->
2021-03-29:19:10:22: <command-address v='127.0.0.1'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22: <!-- Slot Control -->
2021-03-29:19:10:22: <pause-on-battery v='false'/>
2021-03-29:19:10:22: <power v='FULL'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22: <!-- User Information -->
2021-03-29:19:10:22: <passkey v='*****'/>
2021-03-29:19:10:22: <team v='40524'/>
2021-03-29:19:10:22: <user v='Whompithian'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22: <!-- Web Server Sessions -->
2021-03-29:19:10:22: <session-lifetime v='0'/>
2021-03-29:19:10:22: <session-timeout v='0'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22: <!-- Work Unit Control -->
2021-03-29:19:10:22: <stall-detection-enabled v='true'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22: <!-- Folding Slots -->
2021-03-29:19:10:22: <slot id='0' type='CPU'/>
2021-03-29:19:10:22: <slot id='1' type='GPU'>
2021-03-29:19:10:22: <pci-bus v='10'/>
2021-03-29:19:10:22: <pci-slot v='0'/>
2021-03-29:19:10:22: </slot>
2021-03-29:19:10:22: <slot id='2' type='GPU'>
2021-03-29:19:10:22: <pci-bus v='69'/>
2021-03-29:19:10:22: <pci-slot v='0'/>
2021-03-29:19:10:22: </slot>
2021-03-29:19:10:22:</config>
2021-03-29:19:10:22:Trying to access database...
2021-03-29:19:10:22:Successfully acquired database lock
2021-03-29:19:10:22:FS00:Initialized folding slot 00: cpu:54
2021-03-29:19:10:22:FS01:Initialized folding slot 01: gpu:10:0 Vega 10 XL/XT [Radeon RX Vega 56/64]
2021-03-29:19:10:22:FS02:Initialized folding slot 02: gpu:69:0 Vega 10 XL/XT [Radeon RX Vega 56/64]
Code: Select all
2021-03-29:23:05:23:WU01:FS02:0x22:*********************** Log Started 2021-03-29T23:05:22Z ***********************
2021-03-29:23:05:23:WU01:FS02:0x22:*************************** Core22 Folding@home Core ***************************
2021-03-29:23:05:23:WU01:FS02:0x22: Core: Core22
2021-03-29:23:05:23:WU01:FS02:0x22: Type: 0x22
2021-03-29:23:05:23:WU01:FS02:0x22: Version: 0.0.13
2021-03-29:23:05:23:WU01:FS02:0x22: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
2021-03-29:23:05:23:WU01:FS02:0x22: Copyright: 2020 foldingathome.org
2021-03-29:23:05:23:WU01:FS02:0x22: Homepage: https://foldingathome.org/
2021-03-29:23:05:23:WU01:FS02:0x22: Date: Sep 19 2020
2021-03-29:23:05:23:WU01:FS02:0x22: Time: 01:10:35
2021-03-29:23:05:23:WU01:FS02:0x22: Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
2021-03-29:23:05:23:WU01:FS02:0x22: Branch: core22-0.0.13
2021-03-29:23:05:23:WU01:FS02:0x22: Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
2021-03-29:23:05:23:WU01:FS02:0x22: Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
2021-03-29:23:05:23:WU01:FS02:0x22: -funroll-loops -DOPENMM_GIT_HASH="\"189320d0\""
2021-03-29:23:05:23:WU01:FS02:0x22: Platform: linux2 4.19.76-linuxkit
2021-03-29:23:05:23:WU01:FS02:0x22: Bits: 64
2021-03-29:23:05:23:WU01:FS02:0x22: Mode: Release
2021-03-29:23:05:23:WU01:FS02:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
2021-03-29:23:05:23:WU01:FS02:0x22: <peastman@stanford.edu>
2021-03-29:23:05:23:WU01:FS02:0x22: Args: -dir 01 -suffix 01 -version 706 -lifeline 4216 -checkpoint 5
2021-03-29:23:05:23:WU01:FS02:0x22: -opencl-platform 0 -opencl-device 0 -gpu-vendor amd -gpu 0
2021-03-29:23:05:23:WU01:FS02:0x22: -gpu-usage 100
2021-03-29:23:05:23:WU01:FS02:0x22:************************************ libFAH ************************************
2021-03-29:23:05:23:WU01:FS02:0x22: Date: Sep 15 2020
2021-03-29:23:05:23:WU01:FS02:0x22: Time: 05:14:43
2021-03-29:23:05:23:WU01:FS02:0x22: Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
2021-03-29:23:05:23:WU01:FS02:0x22: Branch: HEAD
2021-03-29:23:05:23:WU01:FS02:0x22: Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
2021-03-29:23:05:23:WU01:FS02:0x22: Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
2021-03-29:23:05:23:WU01:FS02:0x22: -funroll-loops
2021-03-29:23:05:23:WU01:FS02:0x22: Platform: linux2 4.19.76-linuxkit
2021-03-29:23:05:23:WU01:FS02:0x22: Bits: 64
2021-03-29:23:05:23:WU01:FS02:0x22: Mode: Release
2021-03-29:23:05:23:WU01:FS02:0x22:************************************ CBang *************************************
2021-03-29:23:05:23:WU01:FS02:0x22: Date: Sep 15 2020
2021-03-29:23:05:23:WU01:FS02:0x22: Time: 05:11:04
2021-03-29:23:05:23:WU01:FS02:0x22: Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
2021-03-29:23:05:23:WU01:FS02:0x22: Branch: HEAD
2021-03-29:23:05:23:WU01:FS02:0x22: Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
2021-03-29:23:05:23:WU01:FS02:0x22: Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
2021-03-29:23:05:23:WU01:FS02:0x22: -funroll-loops -fPIC
2021-03-29:23:05:23:WU01:FS02:0x22: Platform: linux2 4.19.76-linuxkit
2021-03-29:23:05:23:WU01:FS02:0x22: Bits: 64
2021-03-29:23:05:23:WU01:FS02:0x22: Mode: Release
2021-03-29:23:05:23:WU01:FS02:0x22:************************************ System ************************************
2021-03-29:23:05:23:WU01:FS02:0x22: CPU: AMD Ryzen Threadripper 2990WX 32-Core Processor
2021-03-29:23:05:23:WU01:FS02:0x22: CPU ID: AuthenticAMD Family 23 Model 8 Stepping 2
2021-03-29:23:05:23:WU01:FS02:0x22: CPUs: 64
2021-03-29:23:05:23:WU01:FS02:0x22: Memory: 125.51GiB
2021-03-29:23:05:23:WU01:FS02:0x22:Free Memory: 122.25GiB
2021-03-29:23:05:23:WU01:FS02:0x22: Threads: POSIX_THREADS
2021-03-29:23:05:23:WU01:FS02:0x22: OS Version: 4.18
2021-03-29:23:05:23:WU01:FS02:0x22:Has Battery: false
2021-03-29:23:05:23:WU01:FS02:0x22: On Battery: false
2021-03-29:23:05:23:WU01:FS02:0x22: UTC Offset: -7
2021-03-29:23:05:23:WU01:FS02:0x22: PID: 4220
2021-03-29:23:05:23:WU01:FS02:0x22: CWD: /var/lib/fahclient/work
2021-03-29:23:05:23:WU01:FS02:0x22:************************************ OpenMM ************************************
2021-03-29:23:05:23:WU01:FS02:0x22: Revision: 189320d0
2021-03-29:23:05:23:WU01:FS02:0x22:********************************************************************************
2021-03-29:23:05:23:WU01:FS02:0x22:Project: 17435 (Run 0, Clone 172, Gen 111)
2021-03-29:23:05:23:WU01:FS02:0x22:Unit: 0x00000000000000000000000000000000
2021-03-29:23:05:23:WU01:FS02:0x22:Reading tar file core.xml
2021-03-29:23:05:23:WU01:FS02:0x22:Reading tar file integrator.xml.bz2
2021-03-29:23:05:23:WU01:FS02:0x22:Reading tar file state.xml.bz2
2021-03-29:23:05:23:WU01:FS02:0x22:Reading tar file system.xml.bz2
2021-03-29:23:05:23:WU01:FS02:0x22:Digital signatures verified
2021-03-29:23:05:23:WU01:FS02:0x22:Folding@home GPU Core22 Folding@home Core
2021-03-29:23:05:23:WU01:FS02:0x22:Version 0.0.13
2021-03-29:23:05:23:WU01:FS02:0x22: Checkpoint write interval: 25000 steps (2%) [50 total]
2021-03-29:23:05:23:WU01:FS02:0x22: JSON viewer frame write interval: 12500 steps (1%) [100 total]
2021-03-29:23:05:23:WU01:FS02:0x22: XTC frame write interval: 10000 steps (0.8%) [125 total]
2021-03-29:23:05:23:WU01:FS02:0x22: Global context and integrator variables write interval: disabled
2021-03-29:23:05:23:WU01:FS02:0x22:There are 3 platforms available.
2021-03-29:23:05:23:WU01:FS02:0x22:Platform 0: Reference
2021-03-29:23:05:23:WU01:FS02:0x22:Platform 1: CPU
2021-03-29:23:05:23:WU01:FS02:0x22:Platform 2: OpenCL
2021-03-29:23:05:23:WU01:FS02:0x22: opencl-device 0 specified
2021-03-29:23:05:28:WU03:FS02:Upload 56.24%
2021-03-29:23:05:33:WU03:FS02:Upload complete
2021-03-29:23:05:33:WU03:FS02:Server responded WORK_ACK (400)
2021-03-29:23:05:33:WU03:FS02:Final credit estimate, 97604.00 points
2021-03-29:23:05:33:WU03:FS02:Cleaning up
2021-03-29:23:05:42:WU01:FS02:0x22:Attempting to create OpenCL context:
2021-03-29:23:05:42:WU01:FS02:0x22: Configuring platform OpenCL
2021-03-29:23:06:03:WU00:FS00:0xa8:Completed 2300000 out of 5000000 steps (46%)
2021-03-29:23:06:33:WU02:FS01:0x22:Completed 540000 out of 1000000 steps (54%)
2021-03-29:23:07:40:WU00:FS00:0xa8:Completed 2350000 out of 5000000 steps (47%)
2021-03-29:23:08:37:WU02:FS01:0x22:Completed 550000 out of 1000000 steps (55%)
2021-03-29:23:08:37:WU02:FS01:0x22:Checkpoint completed at step 550000
2021-03-29:23:09:12:WU00:FS00:0xa8:Completed 2400000 out of 5000000 steps (48%)
2021-03-29:23:10:42:WU02:FS01:0x22:Completed 560000 out of 1000000 steps (56%)
2021-03-29:23:10:49:WU00:FS00:0xa8:Completed 2450000 out of 5000000 steps (49%)
2021-03-29:23:12:23:WU00:FS00:0xa8:Completed 2500000 out of 5000000 steps (50%)
2021-03-29:23:12:48:WU02:FS01:0x22:Completed 570000 out of 1000000 steps (57%)
2021-03-29:23:13:59:WU00:FS00:0xa8:Completed 2550000 out of 5000000 steps (51%)
2021-03-29:23:14:51:WU02:FS01:0x22:Completed 580000 out of 1000000 steps (58%)
2021-03-29:23:15:37:WU00:FS00:0xa8:Completed 2600000 out of 5000000 steps (52%)
2021-03-29:23:16:54:WU02:FS01:0x22:Completed 590000 out of 1000000 steps (59%)
2021-03-29:23:17:14:WU00:FS00:0xa8:Completed 2650000 out of 5000000 steps (53%)
2021-03-29:23:18:50:WU00:FS00:0xa8:Completed 2700000 out of 5000000 steps (54%)
2021-03-29:23:18:56:WU02:FS01:0x22:Completed 600000 out of 1000000 steps (60%)
2021-03-29:23:18:57:WU02:FS01:0x22:Checkpoint completed at step 600000
Code: Select all
2021-03-30:00:55:05:Caught signal SIGTERM(15) on PID 1698
2021-03-30:00:55:05:Exiting, please wait. . .
2021-03-30:00:55:05:WU01:FS02:0x22:Caught signal SIGTERM(15) on PID 4220
2021-03-30:00:55:05:WU01:FS02:0x22:Exiting, please wait. . .
2021-03-30:00:55:05:WU00:FS01:0x22:Caught signal SIGTERM(15) on PID 4707
2021-03-30:00:55:05:WU00:FS01:0x22:Exiting, please wait. . .
2021-03-30:00:55:05:WU00:FS01:0x22:Folding@home Core Shutdown: INTERRUPTED
2021-03-30:00:55:05:WU03:FS00:0xa7:Caught signal SIGTERM(15) on PID 4643
2021-03-30:00:55:05:WU03:FS00:0xa7:Exiting, please wait. . .
2021-03-30:00:55:06:FS02:Shutting core down
2021-03-30:00:56:07:WARNING:FS02:Killing WU01
2021-03-30:00:56:07:WARNING:FS02:Killing WU01
...thousands of these...
2021-03-30:00:56:34:WARNING:FS02:Killing WU01
2021-03-30:00:56:35:WARNING:FS02:Killing WU01
Code: Select all
Mar 29 17:55:05 folding.home systemd[1]: Stopping Folding@home V7 Client...
Mar 29 17:56:35 folding.home systemd[1]: FAHClient.service: State 'stop-sigterm' timed out. Killing.
Mar 29 17:56:35 folding.home systemd[1]: FAHClient.service: Killing process 1698 (FAHClient) with signal SIGKILL.
Mar 29 17:56:35 folding.home systemd[1]: FAHClient.service: Killing process 4220 (FahCore_22) with signal SIGKILL.
Mar 29 17:58:05 folding.home systemd[1]: FAHClient.service: Processes still around after SIGKILL. Ignoring.
Mar 29 17:59:35 folding.home systemd[1]: FAHClient.service: State 'stop-final-sigterm' timed out. Killing.
Mar 29 17:59:35 folding.home systemd[1]: FAHClient.service: Killing process 1698 (FAHClient) with signal SIGKILL.
Mar 29 17:59:35 folding.home systemd[1]: FAHClient.service: Killing process 4220 (FahCore_22) with signal SIGKILL.
Mar 29 18:01:05 folding.home systemd[1]: FAHClient.service: Processes still around after final SIGKILL. Entering failed mode.
Mar 29 18:01:05 folding.home systemd[1]: FAHClient.service: Failed with result 'timeout'.
Mar 29 18:01:05 folding.home systemd[1]: Stopped Folding@home V7 Client.
viewtopic.php?f=108&t=36871#p349658
bruce wrote:Show us the log of what conditions led to 100% utilization with with zero progress after a day. Something is seriously wrong and we want to prevent that from happening to others.