Page 1 of 2
Bad State detected on GPU (AMD)
Posted: Tue May 12, 2020 8:05 pm
by 4n0n
I'm runnung F@H on Linux Mint 19.3 using AMD Radeon RX 5500 XT with original amd drivers and fahclient v7.6.9.
From time to time i get the following error on my GPU slot:
Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
Then the result is uploaded (why is this 20MiB, when cancelled at 0%) and the next WU download starts. This next WU will fold fine until successfully finished.
No, my system or GPU is NOT overclocked. Every setting is left as it originally was.
Code: Select all
19:48:38:WU00:FS01:Download complete
19:48:38:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:11761 run:0 clone:14757 gen:19 core:0x22 unit:0x0000002780fccb0a5e7113f3d50b53da
19:48:38:WU00:FS01:Starting
19:48:38:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 00 -suffix 01 -version 706 -lifeline 1093 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
19:48:38:WU00:FS01:Started FahCore on PID 26214
19:48:38:WU00:FS01:Core PID:26218
19:48:38:WU00:FS01:FahCore 0x22 started
19:48:38:WU00:FS01:0x22:*********************** Log Started 2020-05-12T19:48:38Z ***********************
19:48:38:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
19:48:38:WU00:FS01:0x22: Type: 0x22
19:48:38:WU00:FS01:0x22: Core: Core22
19:48:38:WU00:FS01:0x22: Website: https://foldingathome.org/
19:48:38:WU00:FS01:0x22: Copyright: (c) 2009-2018 foldingathome.org
19:48:38:WU00:FS01:0x22: Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
19:48:38:WU00:FS01:0x22: <rafal.wiewiora@choderalab.org>
19:48:38:WU00:FS01:0x22: Args: -dir 00 -suffix 01 -version 706 -lifeline 26214 -checkpoint 15
19:48:38:WU00:FS01:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
19:48:38:WU00:FS01:0x22: Config: <none>
19:48:38:WU00:FS01:0x22:************************************ Build *************************************
19:48:38:WU00:FS01:0x22: Version: 0.0.5
19:48:38:WU00:FS01:0x22: Date: Apr 22 2020
19:48:38:WU00:FS01:0x22: Time: 03:57:11
19:48:38:WU00:FS01:0x22: Repository: Git
19:48:38:WU00:FS01:0x22: Revision: 2d69202c898bd9bb3e093f51cd32bf411c2a0388
19:48:38:WU00:FS01:0x22: Branch: HEAD
19:48:38:WU00:FS01:0x22: Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
19:48:38:WU00:FS01:0x22: Options: -std=c++11 -O3 -funroll-loops
19:48:38:WU00:FS01:0x22: Platform: linux2 4.19.76-linuxkit
19:48:38:WU00:FS01:0x22: Bits: 64
19:48:38:WU00:FS01:0x22: Mode: Release
19:48:38:WU00:FS01:0x22:************************************ System ************************************
19:48:38:WU00:FS01:0x22: CPU: AMD Ryzen 5 3600 6-Core Processor
19:48:38:WU00:FS01:0x22: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
19:48:38:WU00:FS01:0x22: CPUs: 12
19:48:38:WU00:FS01:0x22: Memory: 31.37GiB
19:48:38:WU00:FS01:0x22:Free Memory: 25.09GiB
19:48:38:WU00:FS01:0x22: Threads: POSIX_THREADS
19:48:38:WU00:FS01:0x22: OS Version: 5.6
19:48:38:WU00:FS01:0x22:Has Battery: false
19:48:38:WU00:FS01:0x22: On Battery: false
19:48:38:WU00:FS01:0x22: UTC Offset: 2
19:48:38:WU00:FS01:0x22: PID: 26218
19:48:38:WU00:FS01:0x22: CWD: /var/lib/fahclient/work
19:48:38:WU00:FS01:0x22: OS: Linux 5.6.6-050606-generic x86_64
19:48:38:WU00:FS01:0x22: OS Arch: AMD64
19:48:38:WU00:FS01:0x22:********************************************************************************
19:48:38:WU00:FS01:0x22:Project: 11761 (Run 0, Clone 14757, Gen 19)
19:48:38:WU00:FS01:0x22:Unit: 0x0000002780fccb0a5e7113f3d50b53da
19:48:38:WU00:FS01:0x22:Reading tar file core.xml
19:48:38:WU00:FS01:0x22:Reading tar file integrator.xml
19:48:38:WU00:FS01:0x22:Reading tar file state.xml
19:48:38:WU00:FS01:0x22:Reading tar file system.xml
19:48:38:WU00:FS01:0x22:Digital signatures verified
19:48:38:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
19:48:38:WU00:FS01:0x22:Version 0.0.5
19:49:03:WU00:FS01:0x22:Completed 0 out of 2000000 steps (0%)
19:49:03:WU00:FS01:0x22:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
19:49:16:WU00:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
19:49:16:WU00:FS01:0x22:Following exception occured: Particle coordinate is nan
19:49:24:WU00:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
19:49:24:WU00:FS01:0x22:Following exception occured: Particle coordinate is nan
19:49:32:WU00:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
19:49:32:WU00:FS01:0x22:Following exception occured: Particle coordinate is nan
19:49:32:WU00:FS01:0x22:ERROR:114: Max Retries Reached
19:49:32:WU00:FS01:0x22:Saving result file ../logfile_01.txt
19:49:32:WU00:FS01:0x22:Saving result file badstate-0.xml
19:49:32:WU00:FS01:0x22:Saving result file badstate-1.xml
19:49:32:WU00:FS01:0x22:Saving result file badstate-2.xml
19:49:32:WU00:FS01:0x22:Saving result file checkpt.crc
19:49:32:WU00:FS01:0x22:Saving result file science.log
19:49:32:WU00:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
19:49:33:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
19:49:33:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:11761 run:0 clone:14757 gen:19 core:0x22 unit:0x0000002780fccb0a5e7113f3d50b53da
[...]
Any ideas what's wrong and what i could do?
Thanks in advance
Re: Bad State detected on GPU (AMD)
Posted: Tue May 12, 2020 8:43 pm
by bollix47
I see you're using version 5 of core 22 ... that's a new version which has been in widespread testing including full fah. It's not unusual to see errors when testing cores ... the developer uses the information returned to help diagnose and fix those types of errors ... the 3 Bad States errors are documented & returned to the WS ... hence the 20Mib size.
You're lucky in that it happened very early so not much time wasted. Know you're helping with the development of a new core which will eventually be used for all GPU projects.
Re: Bad State detected on GPU (AMD)
Posted: Tue May 12, 2020 9:04 pm
by Joe_H
And you should also receive a partial credit for the time the WU spent on your computer.
I did look up in another person's log file what download size they saw for a Project 11761 WU, that was 29.5 MB. So you are returning enough error information and part of the WU information, in a smaller upload.
Re: Bad State detected on GPU (AMD)
Posted: Wed May 13, 2020 5:43 am
by 4n0n
Thanks bollix47 and Joe_H for your replies.
So there's nothing i can do but hoping for the better.
You're lucky in that it happened very early so not much time wasted.
I saw that kind of error now about 10 times and folding
always was cancelled at 0%.
I keep hoping for the next release...
Re: Bad State detected on GPU (AMD)
Posted: Wed May 13, 2020 7:37 am
by bruce
Is that 10 different times with Project: 11761 (Run 0, Clone 14757, Gen 19) or was that on 10 different WUs?
It certainly is unusual to see NaN errors that early in a WU.
Re: Bad State detected on GPU (AMD)
Posted: Wed May 13, 2020 11:01 pm
by 4n0n
bruce wrote:Is that 10 different times with Project: 11761 (Run 0, Clone 14757, Gen 19) or was that on 10 different WUs?
I think these must have been different WUs as i noticed it now and then in the last couple of weeks. But i didn't check the project number. May be these also have been 11761 WUs, may be not.
Next time i will check it out and report the log here in this thread.
Re: Bad State detected on GPU (AMD)
Posted: Wed May 13, 2020 11:52 pm
by bruce
FACT: there will always be a small percentage of WU errors. The only way to figure that out is to run the WU. FAH does attempt to eliminate undesiareble WUs before they're assigned (such as your NaN errors) but the exact same error might have been caused by hardware with a marginally unstable overclock so the WU is autimatically reassigned to someone else who probably has a stable machine.
I checked the first two projects in your log. The first one failed on 5 different machines and was withdrawn as a "BAD WU." The second one failed several times and then was successfully completed.
Re: Bad State detected on GPU (AMD)
Posted: Sun May 17, 2020 8:04 am
by 4n0n
Issue occured again. The progress bar in FAHControl showed 0.07%. So i guess the WU was not cancelled at the very beginning and some cycles must have been calculated correctly before it was cancelled.
This time it was Project: 13405 (Run 357, Clone 97, Gen 4)
Code: Select all
*********************** Log Started 2020-05-17T07:45:40Z ***********************
07:54:48:FS01:Unpaused
07:54:48:WU01:FS01:Starting
07:54:48:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 1092 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
07:54:48:WU01:FS01:Started FahCore on PID 4160
07:54:48:WU01:FS01:Core PID:4164
07:54:48:WU01:FS01:FahCore 0x22 started
07:54:48:WU01:FS01:0x22:*********************** Log Started 2020-05-17T07:54:48Z ***********************
07:54:48:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
07:54:48:WU01:FS01:0x22: Type: 0x22
07:54:48:WU01:FS01:0x22: Core: Core22
07:54:48:WU01:FS01:0x22: Website: https://foldingathome.org/
07:54:48:WU01:FS01:0x22: Copyright: (c) 2009-2018 foldingathome.org
07:54:48:WU01:FS01:0x22: Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
07:54:48:WU01:FS01:0x22: <rafal.wiewiora@choderalab.org>
07:54:48:WU01:FS01:0x22: Args: -dir 01 -suffix 01 -version 706 -lifeline 4160 -checkpoint 15
07:54:48:WU01:FS01:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
07:54:48:WU01:FS01:0x22: Config: <none>
07:54:48:WU01:FS01:0x22:************************************ Build *************************************
07:54:48:WU01:FS01:0x22: Version: 0.0.5
07:54:48:WU01:FS01:0x22: Date: Apr 22 2020
07:54:48:WU01:FS01:0x22: Time: 03:57:11
07:54:48:WU01:FS01:0x22: Repository: Git
07:54:48:WU01:FS01:0x22: Revision: 2d69202c898bd9bb3e093f51cd32bf411c2a0388
07:54:48:WU01:FS01:0x22: Branch: HEAD
07:54:48:WU01:FS01:0x22: Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
07:54:48:WU01:FS01:0x22: Options: -std=c++11 -O3 -funroll-loops
07:54:48:WU01:FS01:0x22: Platform: linux2 4.19.76-linuxkit
07:54:48:WU01:FS01:0x22: Bits: 64
07:54:48:WU01:FS01:0x22: Mode: Release
07:54:48:WU01:FS01:0x22:************************************ System ************************************
07:54:48:WU01:FS01:0x22: CPU: AMD Ryzen 5 3600 6-Core Processor
07:54:48:WU01:FS01:0x22: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
07:54:48:WU01:FS01:0x22: CPUs: 12
07:54:48:WU01:FS01:0x22: Memory: 31.37GiB
07:54:48:WU01:FS01:0x22:Free Memory: 28.20GiB
07:54:48:WU01:FS01:0x22: Threads: POSIX_THREADS
07:54:48:WU01:FS01:0x22: OS Version: 5.6
07:54:48:WU01:FS01:0x22:Has Battery: false
07:54:48:WU01:FS01:0x22: On Battery: false
07:54:48:WU01:FS01:0x22: UTC Offset: 2
07:54:48:WU01:FS01:0x22: PID: 4164
07:54:48:WU01:FS01:0x22: CWD: /var/lib/fahclient/work
07:54:48:WU01:FS01:0x22: OS: Linux 5.6.6-050606-generic x86_64
07:54:48:WU01:FS01:0x22: OS Arch: AMD64
07:54:48:WU01:FS01:0x22:********************************************************************************
07:54:48:WU01:FS01:0x22:Project: 13405 (Run 357, Clone 97, Gen 4)
07:54:48:WU01:FS01:0x22:Unit: 0x0000000a12bc7d9a5eb3a38c88265414
07:54:48:WU01:FS01:0x22:Reading tar file core.xml
07:54:48:WU01:FS01:0x22:Reading tar file integrator.xml
07:54:48:WU01:FS01:0x22:Reading tar file state.xml
07:54:48:WU01:FS01:0x22:Reading tar file system.xml
07:54:48:WU01:FS01:0x22:Digital signatures verified
07:54:48:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
07:54:48:WU01:FS01:0x22:Version 0.0.5
07:54:57:WU01:FS01:0x22:Completed 0 out of 1000000 steps (0%)
07:54:57:WU01:FS01:0x22:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
07:55:22:WU01:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
07:55:22:WU01:FS01:0x22:Following exception occured: Particle coordinate is nan
07:55:40:WU01:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
07:55:40:WU01:FS01:0x22:Following exception occured: Particle coordinate is nan
07:56:04:WU01:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
07:56:04:WU01:FS01:0x22:Following exception occured: Particle coordinate is nan
07:56:04:WU01:FS01:0x22:ERROR:114: Max Retries Reached
07:56:04:WU01:FS01:0x22:Saving result file ../logfile_01.txt
07:56:04:WU01:FS01:0x22:Saving result file badstate-0.xml
07:56:04:WU01:FS01:0x22:Saving result file badstate-1.xml
07:56:04:WU01:FS01:0x22:Saving result file badstate-2.xml
07:56:04:WU01:FS01:0x22:Saving result file checkpt.crc
07:56:04:WU01:FS01:0x22:Saving result file globals.csv
07:56:04:WU01:FS01:0x22:Saving result file science.log
07:56:04:WU01:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
07:56:05:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
07:56:05:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13405 run:357 clone:97 gen:4 core:0x22 unit:0x0000000a12bc7d9a5eb3a38c88265414
07:56:05:WU01:FS01:Uploading 101.16KiB to 18.188.125.154
07:56:05:WU01:FS01:Connecting to 18.188.125.154:8080
07:56:05:WU02:FS01:Connecting to 65.254.110.245:80
07:56:10:WU02:FS01:Assigned to work server 3.133.76.19
07:56:10:WU02:FS01:Requesting new work unit for slot 01: READY gpu:0:Navi 14 [Radeon RX 5500/5500M / Pro 5500M] from 3.133.76.19
07:56:10:WU02:FS01:Connecting to 3.133.76.19:8080
07:56:12:WU01:FS01:Upload complete
07:56:13:WU01:FS01:Server responded WORK_ACK (400)
07:56:13:WU01:FS01:Cleaning up
Re: Bad State detected on GPU (AMD)
Posted: Sun May 17, 2020 8:07 am
by PantherX
Please note that WUs from Projects 13404 and 13405 are highly experimental so the BAD_WORK_UNIT can be ignored. Useful data is still being sent to the researchers to allow them to investigate and fine-tune their system as this hasn't been done before by the F@H project so it's okay for those 2 projects. You can ignore the errors for now
Re: Bad State detected on GPU (AMD)
Posted: Sun May 17, 2020 8:30 am
by 4n0n
Another one. This time its project
11761 again (Run 0, Clone 4315, Gen 53)
Code: Select all
08:04:53:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:11761 run:0 clone:4315 gen:53 core:0x22 unit:0x0000006780fccb0a5e6d7d6454e5ebbb
08:04:53:WU02:FS01:Starting
08:04:53:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 02 -suffix 01 -version 706 -lifeline 1092 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
08:04:53:WU02:FS01:Started FahCore on PID 5540
08:04:53:WU02:FS01:Core PID:5544
08:04:53:WU02:FS01:FahCore 0x22 started
08:04:53:WU02:FS01:0x22:*********************** Log Started 2020-05-17T08:04:53Z ***********************
08:04:53:WU02:FS01:0x22:*************************** Core22 Folding@home Core ***************************
08:04:53:WU02:FS01:0x22: Type: 0x22
08:04:53:WU02:FS01:0x22: Core: Core22
08:04:53:WU02:FS01:0x22: Website: https://foldingathome.org/
08:04:53:WU02:FS01:0x22: Copyright: (c) 2009-2018 foldingathome.org
08:04:53:WU02:FS01:0x22: Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
08:04:53:WU02:FS01:0x22: <rafal.wiewiora@choderalab.org>
08:04:53:WU02:FS01:0x22: Args: -dir 02 -suffix 01 -version 706 -lifeline 5540 -checkpoint 15
08:04:53:WU02:FS01:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
08:04:53:WU02:FS01:0x22: Config: <none>
08:04:53:WU02:FS01:0x22:************************************ Build *************************************
08:04:53:WU02:FS01:0x22: Version: 0.0.5
08:04:53:WU02:FS01:0x22: Date: Apr 22 2020
08:04:53:WU02:FS01:0x22: Time: 03:57:11
08:04:53:WU02:FS01:0x22: Repository: Git
08:04:53:WU02:FS01:0x22: Revision: 2d69202c898bd9bb3e093f51cd32bf411c2a0388
08:04:53:WU02:FS01:0x22: Branch: HEAD
08:04:53:WU02:FS01:0x22: Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
08:04:53:WU02:FS01:0x22: Options: -std=c++11 -O3 -funroll-loops
08:04:53:WU02:FS01:0x22: Platform: linux2 4.19.76-linuxkit
08:04:53:WU02:FS01:0x22: Bits: 64
08:04:53:WU02:FS01:0x22: Mode: Release
08:04:53:WU02:FS01:0x22:************************************ System ************************************
08:04:53:WU02:FS01:0x22: CPU: AMD Ryzen 5 3600 6-Core Processor
08:04:53:WU02:FS01:0x22: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
08:04:53:WU02:FS01:0x22: CPUs: 12
08:04:53:WU02:FS01:0x22: Memory: 31.37GiB
08:04:53:WU02:FS01:0x22:Free Memory: 28.01GiB
08:04:53:WU02:FS01:0x22: Threads: POSIX_THREADS
08:04:53:WU02:FS01:0x22: OS Version: 5.6
08:04:53:WU02:FS01:0x22:Has Battery: false
08:04:53:WU02:FS01:0x22: On Battery: false
08:04:53:WU02:FS01:0x22: UTC Offset: 2
08:04:53:WU02:FS01:0x22: PID: 5544
08:04:53:WU02:FS01:0x22: CWD: /var/lib/fahclient/work
08:04:53:WU02:FS01:0x22: OS: Linux 5.6.6-050606-generic x86_64
08:04:53:WU02:FS01:0x22: OS Arch: AMD64
08:04:53:WU02:FS01:0x22:********************************************************************************
08:04:53:WU02:FS01:0x22:Project: 11761 (Run 0, Clone 4315, Gen 53)
08:04:53:WU02:FS01:0x22:Unit: 0x0000006780fccb0a5e6d7d6454e5ebbb
08:04:53:WU02:FS01:0x22:Reading tar file core.xml
08:04:53:WU02:FS01:0x22:Reading tar file integrator.xml
08:04:53:WU02:FS01:0x22:Reading tar file state.xml
08:04:53:WU02:FS01:0x22:Reading tar file system.xml
08:04:53:WU02:FS01:0x22:Digital signatures verified
08:04:53:WU02:FS01:0x22:Folding@home GPU Core22 Folding@home Core
08:04:53:WU02:FS01:0x22:Version 0.0.5
08:05:17:WU02:FS01:0x22:Completed 0 out of 2000000 steps (0%)
08:05:17:WU02:FS01:0x22:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
08:05:37:WU02:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
08:05:37:WU02:FS01:0x22:Following exception occured: Particle coordinate is nan
08:05:46:WU02:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
08:05:46:WU02:FS01:0x22:Following exception occured: Particle coordinate is nan
08:05:55:WU02:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
08:05:55:WU02:FS01:0x22:Following exception occured: Particle coordinate is nan
08:05:55:WU02:FS01:0x22:ERROR:114: Max Retries Reached
08:05:55:WU02:FS01:0x22:Saving result file ../logfile_01.txt
08:05:55:WU02:FS01:0x22:Saving result file badstate-0.xml
08:05:55:WU02:FS01:0x22:Saving result file badstate-1.xml
08:05:55:WU02:FS01:0x22:Saving result file badstate-2.xml
08:05:55:WU02:FS01:0x22:Saving result file checkpt.crc
08:05:55:WU02:FS01:0x22:Saving result file science.log
08:05:55:WU02:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
08:05:56:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
08:05:56:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:11761 run:0 clone:4315 gen:53 core:0x22 unit:0x0000006780fccb0a5e6d7d6454e5ebbb
08:05:56:WU02:FS01:Uploading 20.51MiB to 128.252.203.10
Re: Bad State detected on GPU (AMD)
Posted: Mon May 18, 2020 7:14 am
by 4n0n
And two others. This time they are neither 11761 nor 13404/13405. It is
Project:11742 run:0 clone:5828 gen:96 and
Project:11749 run:0 clone:8011 gen:11.
I think i will stop posting these logs now to not spam the forum. But i have one last question: Is it "normal" to have such a high frequency of "Bad State detected" as i have?
Code: Select all
06:36:13:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:11742 run:0 clone:5828 gen:96 core:0x22 unit:0x000000958ca304f15e6bc525f4ab60cd
06:36:13:WU01:FS01:Starting
06:36:13:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 1092 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
06:36:13:WU01:FS01:Started FahCore on PID 3977
06:36:13:WU01:FS01:Core PID:3981
06:36:13:WU01:FS01:FahCore 0x22 started
06:36:14:WU01:FS01:0x22:*********************** Log Started 2020-05-18T06:36:13Z ***********************
06:36:14:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
06:36:14:WU01:FS01:0x22: Type: 0x22
06:36:14:WU01:FS01:0x22: Core: Core22
06:36:14:WU01:FS01:0x22: Website: https://foldingathome.org/
06:36:14:WU01:FS01:0x22: Copyright: (c) 2009-2018 foldingathome.org
06:36:14:WU01:FS01:0x22: Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
06:36:14:WU01:FS01:0x22: <rafal.wiewiora@choderalab.org>
06:36:14:WU01:FS01:0x22: Args: -dir 01 -suffix 01 -version 706 -lifeline 3977 -checkpoint 15
06:36:14:WU01:FS01:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
06:36:14:WU01:FS01:0x22: Config: <none>
06:36:14:WU01:FS01:0x22:************************************ Build *************************************
06:36:14:WU01:FS01:0x22: Version: 0.0.5
06:36:14:WU01:FS01:0x22: Date: Apr 22 2020
06:36:14:WU01:FS01:0x22: Time: 03:57:11
06:36:14:WU01:FS01:0x22: Repository: Git
06:36:14:WU01:FS01:0x22: Revision: 2d69202c898bd9bb3e093f51cd32bf411c2a0388
06:36:14:WU01:FS01:0x22: Branch: HEAD
06:36:14:WU01:FS01:0x22: Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
06:36:14:WU01:FS01:0x22: Options: -std=c++11 -O3 -funroll-loops
06:36:14:WU01:FS01:0x22: Platform: linux2 4.19.76-linuxkit
06:36:14:WU01:FS01:0x22: Bits: 64
06:36:14:WU01:FS01:0x22: Mode: Release
06:36:14:WU01:FS01:0x22:************************************ System ************************************
06:36:14:WU01:FS01:0x22: CPU: AMD Ryzen 5 3600 6-Core Processor
06:36:14:WU01:FS01:0x22: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
06:36:14:WU01:FS01:0x22: CPUs: 12
06:36:14:WU01:FS01:0x22: Memory: 31.37GiB
06:36:14:WU01:FS01:0x22:Free Memory: 8.74GiB
06:36:14:WU01:FS01:0x22: Threads: POSIX_THREADS
06:36:14:WU01:FS01:0x22: OS Version: 5.6
06:36:14:WU01:FS01:0x22:Has Battery: false
06:36:14:WU01:FS01:0x22: On Battery: false
06:36:14:WU01:FS01:0x22: UTC Offset: 2
06:36:14:WU01:FS01:0x22: PID: 3981
06:36:14:WU01:FS01:0x22: CWD: /var/lib/fahclient/work
06:36:14:WU01:FS01:0x22: OS: Linux 5.6.6-050606-generic x86_64
06:36:14:WU01:FS01:0x22: OS Arch: AMD64
06:36:14:WU01:FS01:0x22:********************************************************************************
06:36:14:WU01:FS01:0x22:Project: 11742 (Run 0, Clone 5828, Gen 96)
06:36:14:WU01:FS01:0x22:Unit: 0x000000958ca304f15e6bc525f4ab60cd
06:36:14:WU01:FS01:0x22:Reading tar file core.xml
06:36:14:WU01:FS01:0x22:Reading tar file integrator.xml
06:36:14:WU01:FS01:0x22:Reading tar file state.xml
06:36:15:WU01:FS01:0x22:Reading tar file system.xml
06:36:15:WU01:FS01:0x22:Digital signatures verified
06:36:15:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
06:36:15:WU01:FS01:0x22:Version 0.0.5
06:36:40:WU01:FS01:0x22:Completed 0 out of 2000000 steps (0%)
06:36:40:WU01:FS01:0x22:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
06:37:06:WU01:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
06:37:06:WU01:FS01:0x22:Following exception occured: Particle coordinate is nan
06:37:14:WU01:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
06:37:14:WU01:FS01:0x22:Following exception occured: Particle coordinate is nan
06:37:22:WU01:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
06:37:22:WU01:FS01:0x22:Following exception occured: Particle coordinate is nan
06:37:22:WU01:FS01:0x22:ERROR:114: Max Retries Reached
06:37:22:WU01:FS01:0x22:Saving result file ../logfile_01.txt
06:37:22:WU01:FS01:0x22:Saving result file badstate-0.xml
06:37:25:WU01:FS01:0x22:Saving result file badstate-1.xml
06:37:29:WU01:FS01:0x22:Saving result file badstate-2.xml
06:37:32:WU01:FS01:0x22:Saving result file checkpt.crc
06:37:32:WU01:FS01:0x22:Saving result file science.log
06:37:32:WU01:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
06:37:33:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
06:37:33:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:11742 run:0 clone:5828 gen:96 core:0x22 unit:0x000000958ca304f15e6bc525f4ab60cd
06:37:33:WU01:FS01:Uploading 35.16KiB to 140.163.4.241
Code: Select all
06:52:47:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:11749 run:0 clone:8011 gen:11 core:0x22 unit:0x0000001a8ca304e75e6bb93c5c95109d
06:52:47:WU02:FS01:Starting
06:52:47:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 02 -suffix 01 -version 706 -lifeline 1092 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
06:52:47:WU02:FS01:Started FahCore on PID 4286
06:52:47:WU02:FS01:Core PID:4290
06:52:47:WU02:FS01:FahCore 0x22 started
06:52:47:WU02:FS01:0x22:*********************** Log Started 2020-05-18T06:52:47Z ***********************
06:52:47:WU02:FS01:0x22:*************************** Core22 Folding@home Core ***************************
06:52:47:WU02:FS01:0x22: Type: 0x22
06:52:47:WU02:FS01:0x22: Core: Core22
06:52:47:WU02:FS01:0x22: Website: https://foldingathome.org/
06:52:47:WU02:FS01:0x22: Copyright: (c) 2009-2018 foldingathome.org
06:52:47:WU02:FS01:0x22: Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
06:52:47:WU02:FS01:0x22: <rafal.wiewiora@choderalab.org>
06:52:47:WU02:FS01:0x22: Args: -dir 02 -suffix 01 -version 706 -lifeline 4286 -checkpoint 15
06:52:47:WU02:FS01:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
06:52:47:WU02:FS01:0x22: Config: <none>
06:52:47:WU02:FS01:0x22:************************************ Build *************************************
06:52:47:WU02:FS01:0x22: Version: 0.0.5
06:52:47:WU02:FS01:0x22: Date: Apr 22 2020
06:52:47:WU02:FS01:0x22: Time: 03:57:11
06:52:47:WU02:FS01:0x22: Repository: Git
06:52:47:WU02:FS01:0x22: Revision: 2d69202c898bd9bb3e093f51cd32bf411c2a0388
06:52:47:WU02:FS01:0x22: Branch: HEAD
06:52:47:WU02:FS01:0x22: Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
06:52:47:WU02:FS01:0x22: Options: -std=c++11 -O3 -funroll-loops
06:52:47:WU02:FS01:0x22: Platform: linux2 4.19.76-linuxkit
06:52:47:WU02:FS01:0x22: Bits: 64
06:52:47:WU02:FS01:0x22: Mode: Release
06:52:47:WU02:FS01:0x22:************************************ System ************************************
06:52:47:WU02:FS01:0x22: CPU: AMD Ryzen 5 3600 6-Core Processor
06:52:47:WU02:FS01:0x22: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
06:52:47:WU02:FS01:0x22: CPUs: 12
06:52:47:WU02:FS01:0x22: Memory: 31.37GiB
06:52:47:WU02:FS01:0x22:Free Memory: 8.37GiB
06:52:47:WU02:FS01:0x22: Threads: POSIX_THREADS
06:52:47:WU02:FS01:0x22: OS Version: 5.6
06:52:47:WU02:FS01:0x22:Has Battery: false
06:52:47:WU02:FS01:0x22: On Battery: false
06:52:47:WU02:FS01:0x22: UTC Offset: 2
06:52:47:WU02:FS01:0x22: PID: 4290
06:52:47:WU02:FS01:0x22: CWD: /var/lib/fahclient/work
06:52:47:WU02:FS01:0x22: OS: Linux 5.6.6-050606-generic x86_64
06:52:47:WU02:FS01:0x22: OS Arch: AMD64
06:52:47:WU02:FS01:0x22:********************************************************************************
06:52:47:WU02:FS01:0x22:Project: 11749 (Run 0, Clone 8011, Gen 11)
06:52:47:WU02:FS01:0x22:Unit: 0x0000001a8ca304e75e6bb93c5c95109d
06:52:47:WU02:FS01:0x22:Reading tar file core.xml
06:52:47:WU02:FS01:0x22:Reading tar file integrator.xml
06:52:47:WU02:FS01:0x22:Reading tar file state.xml
06:52:48:WU02:FS01:0x22:Reading tar file system.xml
06:52:49:WU02:FS01:0x22:Digital signatures verified
06:52:49:WU02:FS01:0x22:Folding@home GPU Core22 Folding@home Core
06:52:49:WU02:FS01:0x22:Version 0.0.5
06:53:14:WU02:FS01:0x22:Completed 0 out of 2000000 steps (0%)
06:53:14:WU02:FS01:0x22:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
06:53:27:WU02:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
06:53:27:WU02:FS01:0x22:Following exception occured: Particle coordinate is nan
06:53:35:WU02:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
06:53:35:WU02:FS01:0x22:Following exception occured: Particle coordinate is nan
06:53:43:WU02:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
06:53:43:WU02:FS01:0x22:Following exception occured: Particle coordinate is nan
06:53:43:WU02:FS01:0x22:ERROR:114: Max Retries Reached
06:53:43:WU02:FS01:0x22:Saving result file ../logfile_01.txt
06:53:43:WU02:FS01:0x22:Saving result file badstate-0.xml
06:53:47:WU02:FS01:0x22:Saving result file badstate-1.xml
06:53:50:WU02:FS01:0x22:Saving result file badstate-2.xml
06:53:54:WU02:FS01:0x22:Saving result file checkpt.crc
06:53:54:WU02:FS01:0x22:Saving result file science.log
06:53:54:WU02:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
06:53:55:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
06:53:55:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:11749 run:0 clone:8011 gen:11 core:0x22 unit:0x0000001a8ca304e75e6bb93c5c95109d
06:53:55:WU02:FS01:Uploading 35.09KiB to 140.163.4.231
Re: Bad State detected on GPU (AMD)
Posted: Mon May 18, 2020 7:21 am
by Joe_H
Except for the specifically mentioned projects that are using new techniques and a higher level of errors is expected, no this is not normal.
Usual causes are excessive overclock or overheating, corrupted driver install, or another issue such as failing hardware.
Re: Bad State detected on GPU (AMD)
Posted: Mon May 18, 2020 12:40 pm
by _r2w_ben
Sometimes a driver crash successfully resets the display portion but needs a full system reboot to restore compute capabilities.
Re: Bad State detected on GPU (AMD)
Posted: Mon May 18, 2020 3:02 pm
by 4n0n
_r2w_ben wrote:Sometimes a driver crash successfully resets the display portion but needs a full system reboot to restore compute capabilities.
OK. Didn't know that. In case of a driver crash i would expect
all of the following WUs to crash. In my case only some WUs crash before eventually another WU is folding fine again - without rebooting.
Some minutes ago i had a sequence of 4 WUs failing (Projects 11751, 13404, 13405, 13404) before the fifth WU (Project 16441) now runs correctly.
Digging the logs lets me now think that it must not only be an issue with my hardware but also with the projects. The fails are mainly related to 117xx and 1340x. Never saw one failing on project 16xxx. May be it is a bad combination of hardware, project and core 0x22 v0.0.5 that leads to the high frequency of fails?
Re: Bad State detected on GPU (AMD)
Posted: Mon May 18, 2020 3:16 pm
by 4n0n
Another question comes to my mind:
Will a high frequency of failed WUs (such as described here) lead to a loss of credibility?
I think i read somewhere, that bonus score is only added, if more than 80 percent of the WUs were returned successfully.
Are the returned WUs considered "successfully returned" in my case?