Page 1 of 6

[Solved, kinda] Strange crash/reboot and CMOS corruption only with F@H

Posted: Thu Nov 04, 2021 3:59 am
by Marius
[UPDATE, 10/25/22]: @FrankMB has reported that by just changing his motherboard, the problem was solved. My suspicions of the problem being in software were wrong. I replaced the CPU with an AMD 7950X, got a new motherboard from Asus and new DDR5 RAM, and the problem has gone away. My suspicions now fall on the power delivery circuits of the Gigabyte X570S Aorus Master motherboard, which seem to fail under high load. Thanks to every one who contributed.

I've been supporting F@H since the beginning of the pandemic, around March 2020. My Gentoo Linux box was an AMD Threadripper 1950x, on an Asus Zenith Extreme motherboard, 128GB RAM and two Nvidia 1070 FE cards, in CUDA mode. I used this system for about a year and a half, running with the F@H Linux CPU and GPU clients without problems.

A few weeks ago I upgraded the GPU, from the 2 1070 GPU's to a single 3080 Ti card from EVGA, with the FTW3 cooling system. The single 3080 Ti in CUDA mode is around three times faster than the 2 1070's, but then I started experiencing strange crashes where the PC rebooted inexplicably when Folding. I could not detect any errors in the logs, the system just rebooted. After that, the BIOS/UEFI got stuck at some point trying to find the video card, and all I got was a black screen.

What I found out is that the CMOS/NVRAM had been corrupted, and it was only possible to reboot if I had cleared the CMOS and reconfigured the BIOS data. I checked the battery and replaced it with another one, just in case, but the battery was not the problem.

The reboot problem started happening more frequently, and randomly. It could happen only minutes after starting the F@H client, or hours later. After a few days of this, the motherboard wouldn't boot anymore, not even after clearing the CMOS. The diagnostic display shows that the CMOS is corrupted, and now I can't even get to the UEFI configuration screen.

So I replaced that system with an AMD 5950X, a Gigabyte X570S Aorus Master motherboard but kept the other components. I ran memtest86 overnight to make sure the memory timing was correct, and found no problem. As soon as I configured the system and started Folding, the crash/reboot/corrupt cmos problem happened again. And the symptoms were the same; I had to clear the CMOS to be able to reboot.

I started suspecting the power supply, which is an EVGA Supernova 1600 P2. But it was operating at only around 50% capacity, with the current configuration. To test that scenario, I booted into Windows and ran the FurMark benchmark full-screen at 4K resolution, with the CPU burner running in the background, with all 16 cores on. I left this running for about 10 hours, burning around 850W, without problems. That meant that the power supply, or any of the hardware components, were not the problem.

So I booted back into Linux, and started the F@H client again, but this time running only the CPU threads, with the GPU idle. It ran for about 48 hours, but then the crash/reboot/corrupt cmos problem happened again. I also tested running from 16 to 32 CPU threads, with the same results. While testing only the GPU client, it would crash within minutes to a few hours.

I then tested the Windows F@H client, in CPU only, GPU only and CPU + GPU configurations. The results were the same as the Linux client; I would get the same crash/reboot/corrupt cmos problem.

It might be worth mentioning that I did not overclock the CPU or GPU, and that the temps for the CPU were about 70C when running all 32 threads, which is acceptable. The GPU was around 82C, which is also in the acceptable range.

So now I started suspecting that the F@H client itself is causing the problem somehow. To test this, I joined the BOINC/Rosetta project and ran their CPU client for several days, in both Windows and Linux, and found no problems.

Given that I tested all my hardware with different tools and found no obvious problem, and that the crash/reboot/corrupt cmos problem only happens when running the F@H client, I think that there is some software problem with the F@H client. The version I tested with was 7.6.13, obtained from the Gentoo repository (sci-biology/foldingathome-7.6.13-r1). The Windows version was the same.

Unfortunately, I really can't provide much more information about the crash, because it simply reboots, not leaving any messages in the log file. And thus I will not be able to continue to contribute to this project, until this crash/reboot/corrupt cmos problem is diagnosed and fixed. :(

Re: Strange crash and CMOS corruption after switching to 308

Posted: Thu Nov 04, 2021 6:43 am
by prcowley
Hi Marius

There is an updated client at 7.6.21 I suggest you try.
From your FAH log file, can you list the very start of the work unit until the completion of 1% of the work unit so we can see what is happenign when starting FaH

Cheers
Pete

Re: Strange crash and CMOS corruption after switching to 308

Posted: Thu Nov 04, 2021 9:41 am
by Marius
Hi Pete,

Thanks for the reply. I will try the newer version and report later. Meanwhile, here is the head of one of the log files when it crashed, using the GPU and 30 CPU threads:

Code: Select all

*********************** Log Started 2021-10-23T04:49:48Z ***********************
04:49:48:Trying to access database...
04:49:48:Successfully acquired database lock
04:49:48:Downloading GPUs.txt from assign1.foldingathome.org:80
04:49:48:Connecting to assign1.foldingathome.org:80
04:49:48:Read GPUs.txt
04:49:48:Enabled folding slot 00: READY cpu:30
04:49:48:Enabled folding slot 01: READY gpu:0:GA102 [GeForce RTX 3080 Ti]
04:49:48:****************************** FAHClient ******************************
04:49:48:        Version: 7.6.13
04:49:48:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
04:49:48:      Copyright: 2020 foldingathome.org
04:49:48:       Homepage: https://foldingathome.org/
04:49:48:           Date: Apr 28 2020
04:49:48:           Time: 04:20:27
04:49:48:       Revision: 5a652817f46116b6e135503af97f18e094414e3b
04:49:48:         Branch: master
04:49:48:       Compiler: GNU 4.9.4
04:49:48:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
04:49:48:                 -funroll-loops
04:49:48:       Platform: linux2 4.19.0-5-amd64
04:49:48:           Bits: 64
04:49:48:           Mode: Release
04:49:48:         Config: /home/marius/fahclient/config.xml
04:49:48:******************************** CBang ********************************
04:49:48:           Date: Apr 25 2020
04:49:48:           Time: 00:07:55
04:49:48:       Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
04:49:48:         Branch: master
04:49:48:       Compiler: GNU 4.9.4
04:49:48:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
04:49:48:                 -funroll-loops -fPIC
04:49:48:       Platform: linux2 4.19.0-5-amd64
04:49:48:           Bits: 64
04:49:48:           Mode: Release
04:49:48:******************************* System ********************************
04:49:48:            CPU: AMD Ryzen 9 5950X 16-Core Processor
04:49:48:         CPU ID: AuthenticAMD Family 25 Model 33 Stepping 0
04:49:48:           CPUs: 32
04:49:48:         Memory: 125.72GiB
04:49:48:    Free Memory: 121.00GiB
04:49:48:        Threads: POSIX_THREADS
04:49:48:     OS Version: 5.14
04:49:48:    Has Battery: false
04:49:48:     On Battery: false
04:49:48:     UTC Offset: -7
04:49:48:            PID: 4796
04:49:48:            CWD: /home/marius/fahclient
04:49:48:             OS: Linux 5.14.10-gentoo x86_64
04:49:48:        OS Arch: AMD64
04:49:48:           GPUs: 1
04:49:48:          GPU 0: Bus:9 Slot:0 Func:0 NVIDIA:8 GA102 [GeForce RTX 3080 Ti]
04:49:48:  CUDA Device 0: Platform:0 Device:0 Bus:9 Slot:0 Compute:8.6 Driver:11.5
04:49:48:OpenCL Device 0: Platform:0 Device:0 Bus:9 Slot:0 Compute:3.0 Driver:495.29
04:49:48:******************************* libFAH ********************************
04:49:48:           Date: Apr 15 2020
04:49:48:           Time: 21:43:27
04:49:48:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
04:49:48:         Branch: master
04:49:48:       Compiler: GNU 4.9.4
04:49:48:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
04:49:48:                 -funroll-loops
04:49:48:       Platform: linux2 4.19.0-5-amd64
04:49:48:           Bits: 64
04:49:48:           Mode: Release
04:49:48:***********************************************************************
04:49:48:<config>
04:49:48:  <!-- Folding Slot Configuration -->
04:49:48:  <cause v='COVID_19'/>
04:49:48:  <cpus v='30'/>
04:49:48:
04:49:48:  <!-- Slot Control -->
04:49:48:  <power v='FULL'/>
04:49:48:
04:49:48:  <!-- User Information -->
04:49:48:  <passkey v='*****'/>
04:49:48:  <team v='11298'/>
04:49:48:  <user v='MariusCaldas'/>
04:49:48:
04:49:48:  <!-- Folding Slots -->
04:49:48:  <slot id='0' type='CPU'/>
04:49:48:  <slot id='1' type='GPU'/>
04:49:48:</config>
04:49:48:WU02:FS01:Starting
04:49:48:WU02:FS01:Running FahCore: /opt/foldingathome/FAHCoreWrapper /home/marius/fahclient/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 02 -suffix 01 -version 706 -lifeline 4796 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
04:49:48:WU02:FS01:Started FahCore on PID 4816
04:49:48:WU02:FS01:Core PID:4820
04:49:48:WU02:FS01:FahCore 0x22 started
04:49:48:WU00:FS00:Starting
04:49:48:WU00:FS00:Running FahCore: /opt/foldingathome/FAHCoreWrapper /home/marius/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 00 -suffix 01 -version 706 -lifeline 4796 -checkpoint 15 -np 30
04:49:48:WU00:FS00:Started FahCore on PID 4821
04:49:48:WU00:FS00:Core PID:4825
04:49:48:WU00:FS00:FahCore 0xa8 started
04:49:49:WU02:FS01:0x22:*********************** Log Started 2021-10-23T04:49:48Z ***********************
04:49:49:WU02:FS01:0x22:*************************** Core22 Folding@home Core ***************************
04:49:49:WU02:FS01:0x22:       Core: Core22
04:49:49:WU02:FS01:0x22:       Type: 0x22
04:49:49:WU02:FS01:0x22:    Version: 0.0.13
04:49:49:WU02:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
04:49:49:WU02:FS01:0x22:  Copyright: 2020 foldingathome.org
04:49:49:WU02:FS01:0x22:   Homepage: https://foldingathome.org/
04:49:49:WU02:FS01:0x22:       Date: Sep 19 2020
04:49:49:WU02:FS01:0x22:       Time: 01:10:35
04:49:49:WU02:FS01:0x22:   Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
04:49:49:WU02:FS01:0x22:     Branch: core22-0.0.13
04:49:49:WU02:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
04:49:49:WU02:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
04:49:49:WU02:FS01:0x22:             -funroll-loops -DOPENMM_GIT_HASH="\"189320d0\""
04:49:49:WU02:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
04:49:49:WU02:FS01:0x22:       Bits: 64
04:49:49:WU02:FS01:0x22:       Mode: Release
04:49:49:WU02:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
04:49:49:WU02:FS01:0x22:             <peastman@stanford.edu>
04:49:49:WU02:FS01:0x22:       Args: -dir 02 -suffix 01 -version 706 -lifeline 4816 -checkpoint 15
04:49:49:WU02:FS01:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
04:49:49:WU02:FS01:0x22:             0 -gpu 0
04:49:49:WU02:FS01:0x22:************************************ libFAH ************************************
04:49:49:WU02:FS01:0x22:       Date: Sep 15 2020
04:49:49:WU02:FS01:0x22:       Time: 05:14:43
04:49:49:WU02:FS01:0x22:   Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
04:49:49:WU02:FS01:0x22:     Branch: HEAD
04:49:49:WU02:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
04:49:49:WU02:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
04:49:49:WU02:FS01:0x22:             -funroll-loops
04:49:49:WU02:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
04:49:49:WU02:FS01:0x22:       Bits: 64
04:49:49:WU02:FS01:0x22:       Mode: Release
04:49:49:WU02:FS01:0x22:************************************ CBang *************************************
04:49:49:WU02:FS01:0x22:       Date: Sep 15 2020
04:49:49:WU02:FS01:0x22:       Time: 05:11:04
04:49:49:WU02:FS01:0x22:   Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
04:49:49:WU02:FS01:0x22:     Branch: HEAD
04:49:49:WU02:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
04:49:49:WU02:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
04:49:49:WU02:FS01:0x22:             -funroll-loops -fPIC
04:49:49:WU02:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
04:49:49:WU02:FS01:0x22:       Bits: 64
04:49:49:WU02:FS01:0x22:       Mode: Release
04:49:49:WU02:FS01:0x22:************************************ System ************************************
04:49:49:WU02:FS01:0x22:        CPU: AMD Ryzen 9 5950X 16-Core Processor
04:49:49:WU02:FS01:0x22:     CPU ID: AuthenticAMD Family 25 Model 33 Stepping 0
04:49:49:WU02:FS01:0x22:       CPUs: 32
04:49:49:WU02:FS01:0x22:     Memory: 125.72GiB
04:49:49:WU02:FS01:0x22:Free Memory: 120.94GiB
04:49:49:WU02:FS01:0x22:    Threads: POSIX_THREADS
04:49:49:WU02:FS01:0x22: OS Version: 5.14
04:49:49:WU02:FS01:0x22:Has Battery: false
04:49:49:WU02:FS01:0x22: On Battery: false
04:49:49:WU02:FS01:0x22: UTC Offset: -7
04:49:49:WU02:FS01:0x22:        PID: 4820
04:49:49:WU02:FS01:0x22:        CWD: /home/marius/fahclient/work
04:49:49:WU02:FS01:0x22:************************************ OpenMM ************************************
04:49:49:WU02:FS01:0x22:   Revision: 189320d0
04:49:49:WU02:FS01:0x22:********************************************************************************
04:49:49:WU02:FS01:0x22:Project: 18201 (Run 31949, Clone 0, Gen 6)
04:49:49:WU02:FS01:0x22:Unit: 0x00000000000000000000000000000000
04:49:49:WU02:FS01:0x22:Digital signatures verified
04:49:49:WU02:FS01:0x22:Folding@home GPU Core22 Folding@home Core
04:49:49:WU02:FS01:0x22:Version 0.0.13
04:49:49:WU02:FS01:0x22:  Checkpoint write interval: 25000 steps (2%) [50 total]
04:49:49:WU02:FS01:0x22:  JSON viewer frame write interval: 12500 steps (1%) [100 total]
04:49:49:WU02:FS01:0x22:  XTC frame write interval: 20000 steps (1.6%) [62 total]
04:49:49:WU02:FS01:0x22:  Global context and integrator variables write interval: disabled
04:49:49:WU02:FS01:0x22:There are 4 platforms available.
04:49:49:WU02:FS01:0x22:Platform 0: Reference
04:49:49:WU02:FS01:0x22:Platform 1: CPU
04:49:49:WU02:FS01:0x22:Platform 2: OpenCL
04:49:49:WU02:FS01:0x22:  opencl-device 0 specified
04:49:49:WU02:FS01:0x22:Platform 3: CUDA
04:49:49:WU02:FS01:0x22:  cuda-device 0 specified
04:49:49:WU00:FS00:0xa8:*********************** Log Started 2021-10-23T04:49:48Z ***********************
04:49:49:WU00:FS00:0xa8:************************** Gromacs Folding@home Core ***************************
04:49:49:WU00:FS00:0xa8:       Core: Gromacs
04:49:49:WU00:FS00:0xa8:       Type: 0xa8
04:49:49:WU00:FS00:0xa8:    Version: 0.0.12
04:49:49:WU00:FS00:0xa8:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
04:49:49:WU00:FS00:0xa8:  Copyright: 2020 foldingathome.org
04:49:49:WU00:FS00:0xa8:   Homepage: https://foldingathome.org/
04:49:49:WU00:FS00:0xa8:       Date: Jan 16 2021
04:49:49:WU00:FS00:0xa8:       Time: 19:24:44
04:49:49:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
04:49:49:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
04:49:49:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie
04:49:49:WU00:FS00:0xa8:   Platform: linux2 4.15.0-128-generic
04:49:49:WU00:FS00:0xa8:       Bits: 64
04:49:49:WU00:FS00:0xa8:       Mode: Release
04:49:49:WU00:FS00:0xa8:       SIMD: avx2_256
04:49:49:WU00:FS00:0xa8:     OpenMP: ON
04:49:49:WU00:FS00:0xa8:       CUDA: OFF
04:49:49:WU00:FS00:0xa8:       Args: -dir 00 -suffix 01 -version 706 -lifeline 4821 -checkpoint 15 -np
04:49:49:WU00:FS00:0xa8:             30
04:49:49:WU00:FS00:0xa8:************************************ libFAH ************************************
04:49:49:WU00:FS00:0xa8:       Date: Jan 16 2021
04:49:49:WU00:FS00:0xa8:       Time: 19:21:38
04:49:49:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
04:49:49:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
04:49:49:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie
04:49:49:WU00:FS00:0xa8:   Platform: linux2 4.15.0-128-generic
04:49:49:WU00:FS00:0xa8:       Bits: 64
04:49:49:WU00:FS00:0xa8:       Mode: Release
04:49:49:WU00:FS00:0xa8:************************************ CBang *************************************
04:49:49:WU00:FS00:0xa8:       Date: Jan 16 2021
04:49:49:WU00:FS00:0xa8:       Time: 19:21:24
04:49:49:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
04:49:49:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
04:49:49:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
04:49:49:WU00:FS00:0xa8:   Platform: linux2 4.15.0-128-generic
04:49:49:WU00:FS00:0xa8:       Bits: 64
04:49:49:WU00:FS00:0xa8:       Mode: Release
04:49:49:WU00:FS00:0xa8:************************************ System ************************************
04:49:49:WU00:FS00:0xa8:        CPU: AMD Ryzen 9 5950X 16-Core Processor
04:49:49:WU00:FS00:0xa8:     CPU ID: AuthenticAMD Family 25 Model 33 Stepping 0
04:49:49:WU00:FS00:0xa8:       CPUs: 32
04:49:49:WU00:FS00:0xa8:     Memory: 125.72GiB
04:49:49:WU00:FS00:0xa8:Free Memory: 120.94GiB
04:49:49:WU00:FS00:0xa8:    Threads: POSIX_THREADS
04:49:49:WU00:FS00:0xa8: OS Version: 5.14
04:49:49:WU00:FS00:0xa8:Has Battery: false
04:49:49:WU00:FS00:0xa8: On Battery: false
04:49:49:WU00:FS00:0xa8: UTC Offset: -7
04:49:49:WU00:FS00:0xa8:        PID: 4825
04:49:49:WU00:FS00:0xa8:        CWD: /home/marius/fahclient/work
04:49:49:WU00:FS00:0xa8:********************************************************************************
04:49:49:WU00:FS00:0xa8:Project: 16958 (Run 77, Clone 16, Gen 4)
04:49:49:WU00:FS00:0xa8:Unit: 0x00000000000000000000000000000000
04:49:49:WU00:FS00:0xa8:Digital signatures verified
04:49:49:WU00:FS00:0xa8:Calling: mdrun -c frame4.gro -s frame4.tpr -x frame4.xtc -cpi state.cpt -cpt 15 -nt 30 -ntmpi 1
04:49:49:WU00:FS00:0xa8:Steps: first=0 total=5000000
04:49:49:WU00:FS00:0xa8:Completed 444652 out of 5000000 steps (8%)
[93m04:49:50:WARNING:3:127.0.0.1:404 HTTP NOT FOUND /undefined[0m
04:49:50:4:127.0.0.1:New Web session
04:49:53:WU00:FS00:0xa8:Completed 450000 out of 5000000 steps (9%)
04:49:57:WU02:FS01:0x22:Attempting to create CUDA context:
04:49:57:WU02:FS01:0x22:  Configuring platform CUDA
04:50:05:WU02:FS01:0x22:  Using CUDA and gpu 0
04:50:05:WU02:FS01:0x22:Completed 575000 out of 1250000 steps (46%)
04:50:27:WU00:FS00:0xa8:Completed 500000 out of 5000000 steps (10%)
04:50:58:WU02:FS01:0x22:Completed 587500 out of 1250000 steps (47%)
04:51:00:WU00:FS00:0xa8:Completed 550000 out of 5000000 steps (11%)
04:51:34:WU00:FS00:0xa8:Completed 600000 out of 5000000 steps (12%)
04:51:52:WU02:FS01:0x22:Completed 600000 out of 1250000 steps (48%)
04:51:52:WU02:FS01:0x22:Checkpoint completed at step 600000
04:52:07:WU00:FS00:0xa8:Completed 650000 out of 5000000 steps (13%)
04:52:39:WU00:FS00:0xa8:Completed 700000 out of 5000000 steps (14%)
04:52:46:WU02:FS01:0x22:Completed 612500 out of 1250000 steps (49%)
04:53:12:WU00:FS00:0xa8:Completed 750000 out of 5000000 steps (15%)
04:53:39:WU02:FS01:0x22:Completed 625000 out of 1250000 steps (50%)
04:53:40:WU02:FS01:0x22:Checkpoint completed at step 625000
The log ends with the below lines, without any sign of anything wrong before it rebooted:

Code: Select all

09:05:55:WU01:FS01:0x22:Checkpoint completed at step 500000
09:06:30:WU01:FS01:0x22:Completed 525000 out of 2500000 steps (21%)
09:07:05:WU01:FS01:0x22:Completed 550000 out of 2500000 steps (22%)
09:07:40:WU01:FS01:0x22:Completed 575000 out of 2500000 steps (23%)
09:07:51:WU02:FS00:0xa8:Completed 300000 out of 10000000 steps (3%)
09:08:16:WU01:FS01:0x22:Completed 600000 out of 2500000 steps (24%)
09:08:51:WU01:FS01:0x22:Completed 625000 out of 2500000 steps (25%)
09:08:51:WU01:FS01:0x22:Checkpoint completed at step 625000
09:09:26:WU01:FS01:0x22:Completed 650000 out of 2500000 steps (26%)
09:10:01:WU01:FS01:0x22:Completed 675000 out of 2500000 steps (27%)
09:10:36:WU01:FS01:0x22:Completed 700000 out of 2500000 steps (28%)
If there is any more helpful information I can assist with, please let me know.

Re: Strange crash and CMOS corruption after switching to 308

Posted: Fri Nov 05, 2021 6:44 pm
by gordonbb
Marius,

This is a weird one. I've found that F@H hammers a GPU much more than most BOINC Projects.

I'm wondering if you are experiencing issues with the Vcore VRM on the 3080Ti like those experienced by users of New World just not quite as fatal.

Have you tried running F@H with a power limit set on the GPU?

The EVGA 3080Ti FTW Ultra has a TDP of 350W and the current BIOS on the card has an adjustment range of -75 to +12% so you should be able to limit through the range of 87.5W to 392W.

It's also a good idea to power limit GPUs in general as the default set-points for power are typically quite inefficient. Typically I run my 2070 Supers at 170W rather than the 215W default so at about 80% of the default which would translate to about 275W for your 3080Ti. This results in a significant power saving for a almost insignificant drop in Production (PPD).

Code: Select all

nvidia-smi -pm 1
nvidia-smi -i 0 -pl 275
Also, have you loaded the EVGA Precision X1 utility in Windoze and seen if there is an update vBIOS for the card?

Re: Strange crash and CMOS corruption after switching to 308

Posted: Sat Nov 06, 2021 1:21 am
by Marius
@prcowley:
Hi Pete, I tried the new version (7.6.21) on Windows 10 and had the same results. This time I was browsing with FF while the F@H client ran on the background. It only took a few minutes to see it happen. The GUI froze for a second and then it simply rebooted, without bothering to produce a BSOD. Nothing in the logs either, as posted above.

@gordonbb:
Hi Gordon, yes, it's a very weird one. My PS is an EVGA Supernova 1600 P2, which in theory should be able to handle a single 3080 Ti easily. On my UPS display, I could watch the whole system use up to 900W, for a few seconds. But the power consumption swung wildly between 500 to 900W, averaging about 800W. The power was peaking so quickly that it made my lights in the room flicker.

However, I left it running FurMark @ 4K for 10 hours overnight, with the CPU burner in the background running 32 threads. No problems there, but the power usage was more stable, around 850W. When I used F@H with the 2 1070's plus 28 threads in the CPU, the power consumption was around 750W, but it was also more stable, without the wild peaks and bottoms I saw with the 3080 Ti. I never had a problem with that configuration.

All my BIOS and FW is up to date, so I don't think that is the problem. I will try limiting the power as you suggested above and see if that helps, that was a good idea.

Thanks to you both,
Marius

Re: Strange crash and CMOS corruption after switching to 308

Posted: Sat Nov 06, 2021 4:17 am
by aetch
Typically the black screen reboot is the PSU switching off and back on again.
It's something that has affected a number of my previous upgrades over the years.
It's not that the PSU isn't powerful enough, it's that it cannot react quickly enough to the sudden spikes in power draw from the graphics cards causing the supply voltage to drop out.
Also, with the more powerful power supplies the voltage supplies are split among a number of rails, not all 12V+ supplies in your power supply are connected together.
With that in mind, I wonder if your 3080 is connected to a single rail and it's overloading that single rail.
It could be worth taking a closer look at your power supply and pairing up a couple of different rails to your graphics card. They should be labelled 12V1+, 12V2+, 12V3+, etc.

I'm also for power limiting your GPU.
I typically use MSI afterburner in Windows, it's great to configure and will survive reboots with a couple of clicks.
Nvdia-smi is packaged with the geforce drivers so is available whichever operating system you choose, it just needs a little more work to survive computer reboots.

Re: Strange crash and CMOS corruption after switching to 308

Posted: Sat Nov 06, 2021 5:00 am
by Marius
Hi aetch,

The 3080 is definitely connected to 3 different rails, but the reboot also occurred when the 3080 was sitting idle, and I was running _only_ the F@H CPU threads. The power consumption remained constant, and at around 400W, with 30 threads. So I think the PS is not the problem, although the symptoms seem to point at it.

Thanks,
Marius

Re: Strange crash and CMOS corruption after switching to 308

Posted: Sat Nov 06, 2021 11:09 am
by Marius
@gordonbb, @aetch,

I just ran a test on my Gentoo Linux box with the 3080 enabled, and its power restricted from the default of 400W down to 250W, and 31 CPU threads. It ran for about 4.5 hours, but the problem persists. It rebooted, and the BIOS was stuck in a black screen because the CMOS had once again been corrupted. I don't think this is caused by the PS, but I will be switching it with a newer one, just in case. I will let you know.

Thanks,
Marius

Re: Strange crash and CMOS corruption after switching to 308

Posted: Thu Nov 11, 2021 8:42 am
by gordonbb
At this point the only two components you haven’t tried replacing are the power supply and GPU.

I know those P2s have a solid reputation and should be able to handle the load but it’s beginning to look like it might have a defect.

Re: Strange crash and CMOS corruption after switching to 308

Posted: Fri Nov 12, 2021 1:51 am
by Marius
@gordonbb

So I actually replaced the P2 with the Corsair AX1600i a couple of days ago. I bought it because they have GaN MOSFET's and can respond to power surges much faster. It is considered by most reviewers to be the best PS you can buy right now. But unfortunately it didn't help.

Yesterday, I left the Linux F@H client running overnight, but it crashed/rebooted again. I limited power to the GPU, giving it only 275W out of it's 400W nominal limit, and each of the 3 power connectors was running out of its own rail from the power supply.

Given the absurd prices for GPU's right now, the only GPU I can replace the 3080 Ti with is my older 1070 FE. That's the last thing I can try hardware wise. I will let you guys know later.

Thanks,
Marius

Re: Strange crash and CMOS corruption after switching to 308

Posted: Sun Nov 14, 2021 8:09 am
by gordonbb
So the GPU is the last man standing then.

I’d open a case with EVGA. They’ve got really good support. I *might* own 12 of their cards :-)

Re: Strange crash and CMOS corruption after switching to 308

Posted: Wed Nov 17, 2021 10:20 am
by Marius
Well, I'm out of ideas right now. I just tested with my older 1070 FE overnight in place of the 3080; it ran for about 18 hours, but eventually it also crash/rebooted. The only thing I can replace now is the motherboard itself. I thought it might be related to the MB's chipset, an AMD X570S, in that because it has no fan, it would overheat with the massive FTW3 heat sink blowing right on top of it. So I maxed out all fans to increase air flow over the chipset, but that didn't work either. It crashed again. The main points that I found are that: the crash happens even with just the CPU threads working, and the GPU sitting idle; and that this crash _only_happens_ when running the F@H client. Be it either the Windows or Linux version. So for now it's goodbye F@H. Maybe I will try again later when a new version comes out.
Thanks all for the ideas.

Re: Strange crash and CMOS corruption after switching to 308

Posted: Wed Nov 17, 2021 11:06 am
by aetch
I'm going to suggest testing your CPU for stability.

Monitoring tools
CPUid HWMonitor
HWInfo/HWInfo64
Your motherboard should have monitoring tools as well

CPU Stress test
Prime 95 (the more recent versions support AVX/AVX256/AVX512) - this helped me stabilise my Ryzen when I first got it and diagnose it six months ago when my system fell over 3 days on the trot, I replaced the motherboard because the B450 VRMs were too weak for my Ryzen.

Memory Stress Test
memtest86 (this is operating independent and requires a blank flash drive)

Re: Strange crash and CMOS corruption after switching to 308

Posted: Wed Nov 17, 2021 11:32 am
by Neil-B
Marius wrote: So for now it's goodbye F@H. Maybe I will try again later when a new version comes out.
Thanks all for the ideas.
Given you are the only reported case having this issue (out of thousands) it is extremely unlikely it is something coded in FaH that is causing your crashes ... It is however perfectly possible that the workload that FaH puts on your system be it cpu/gpu is something that due to a peculiarity in your setup (software, or more likely hardware) means an instability is occurring.

I perfectly understand that is your kit if crashing when FaH is running then not running FaH makes absolute sense ... It is however relatively unlikely that future versions of FaH will load your kit any differently and as such tbh I wouldn't hold out much hope that with your current setup future FaH software will be any less of an issue - but fingers crossed if/when you try.

Sometimes causes are just too hard to track down and it is easier/more sensible/the right thing to do to not persevere !!

Re: Strange crash and CMOS corruption after switching to 308

Posted: Wed Nov 17, 2021 6:18 pm
by Marius
@aetch, @Neil-B:

I ran many tests and benchmarks, in Windows and Linux, without any problems. CPU, Memory and GPU are absolutely stable in every test and benchmark I ran. For example, I ran FurMark, which really taxes the GPU, while at the same time running its "CPU Burner" benchmark in the background, with 32 threads @100% utilization, for 10 hours. Memtest86 ran overnight for more than 10 hours. No problem there. This is a really strange situation, and I ran out of things to try. To fully debug this, I would have to put much more time and instrumentation than I have available right now.

Thanks all for the ideas.