[Solved, kinda] Strange crash/reboot and CMOS corruption only with F@H

gunnarre · Post by **gunnarre** » Fri Nov 19, 2021 12:35 pm

Marius wrote:For example, I ran FurMark, which really taxes the GPU, while at the same time running its "CPU Burner" benchmark in the background, with 32 threads @100% utilization, for 10 hours.

Unless Furmark's CPU Burner does AVX, AVX_256 and AVX_512 loads, then it's not a realistic benchmark for folding or other vector operations. Please try Prime95's AVX tests alone or at the same time as the Furmark GPU benchmark if you want to replicate a similar load as folding.

Marius · Post by **Marius** » Sat Nov 20, 2021 6:21 am

@gunnarre

I just ran a 17-hour stress test as you suggested, with Prime95 AVX on the CPU and FurMark on the GPU. The PC was drawing about 650W as measured by the AX1600i PSU. There were no problems, and the PC didn't crash/reboot as it does with the F@H client. So it seems really stable hardware wise. But thanks for yet another idea for testing.

JimF · Post by **JimF** » Sat Nov 20, 2021 8:34 pm

Marius wrote:I ran memtest86 overnight to make sure the memory timing was correct, and found no problem. As soon as I configured the system and started Folding, the crash/reboot/corrupt cmos problem happened again. And the symptoms were the same; I had to clear the CMOS to be able to reboot.

It is the memory. Memtest may find a faulty module, but instability is another bag.

Do you have four modules? If you use the XMP profiles, reduce them to two modules.
If you need four, then try the motherboard defaults, but for Folding, you certainly will not need much.

Marius · Post by **Marius** » Mon Nov 22, 2021 11:37 pm

@JimF

[UPDATE 11/23/21]: So the system fooled me into thinking it was memory timing.

A few hours later, the system crashed again. With DDR4 2100 settings. Case NOT closed!

Yes, I have 4 sticks of DDR4 that were set for XMP 3200. After I set it back to 2100, it has been running for 42 hours without problems.

gunnarre · Post by **gunnarre** » Tue Nov 23, 2021 10:48 am

Try running with just two sticks, then the other two sticks.

Post by **bruce** » Tue Nov 23, 2021 10:59 am

Also try un-overclocking.

Marius · Post by **Marius** » Tue Nov 23, 2021 11:10 am

gunnarre wrote:Try running with just two sticks, then the other two sticks.

Yes, testing that now. We'll see.

bruce wrote:Also try un-overclocking.

This system is not overclocked.

Thanks for all the ideas!

Marius · Post by **Marius** » Wed Nov 24, 2021 11:12 am

Well, I split the original 4 sticks of 32GB dual rank DDR4 into 2 groups, and tested each separately at 2100 timings. No deal, both tests crashed and rebooted after a few hours. The CMOS was again corrupted, resulting in an unbootable system. CMOS clearing is necessary to get past that. I also tested a different set of 2 sticks of 8GB single rank DDR4 3600, but at 2100. Nope, that didn't work either. It rebooted in a mere 10 minutes, while I was looking at the screen monitoring temps, which were OK. So at this point I can say for sure it's not memory instability, or a bad stick of RAM, or any kind of hardware problem on my side. Again, the crash only occurs when using F@H. Since I have already replaced and tested __every__ __single__ system component, I'm running out of ideas. Anyway, happy Thanksgiving to you guys in the US.

Post by **toTOW** » Wed Nov 24, 2021 7:24 pm

If I were you, I'd get my motherboard replaced ... it's not a normal behaviour.

Marius · Post by **Marius** » Wed Nov 24, 2021 11:42 pm

I already replaced the mobo! It's the second one that displays the same behavior, and is totally unrelated to the first, which was a Zenith Extreme for the AMD Threadripper from Asus. The current one is a Gigabyte Aorus Master x570s that came out in June, which I use with an AMD 5950x. So, different processor and mobo!! It looks like I have a transmissible Gremlin!

JimF · Post by **JimF** » Tue Dec 07, 2021 6:22 pm

You haven't mentioned the disk drive, which I assume is an SSD.
I don't think that I have ever had one go bad on me, but it looks like you have eliminated all other possibilities.
(It isn't FAH).

Also, I always use a write cache to protect the SSD. On Ubuntu, I use the built-in Linux write cache.
https://lonesysadmin.net/2013/12/22/bet ... rty_ratio/

On Win10, I use PrimoCache.
https://www.romexsoftware.com/en-us/pri ... index.html

It is not necessary for FAH, which is very light on writes, but for some of the BOINC projects it is.
It looks like the Rosetta pythons are bad, though I don't know how bad yet.
I use at least 2 GB of memory for the cache, and at least 30 minutes of latency (write-delay), though less will help.

PS - You may have transmitted it to my Ryzen 3950X, which started seizing up a few days ago, but it has 128 GB of memory and I suspect one stick may be bad.
But try to deal with it at you end, please.

Marius · Post by **Marius** » Wed Dec 08, 2021 12:40 am

@JimF

The drive is an Areca SSD RAID card, with battery back-up and a bank of super-capacitors to protect the write-cache. That card hasn't given any problems yet. To remove it and test with another drive will be a chore that I don't have time yet to do. Maybe on the holiday breaks.

But yes, I also have 128GB of RAM. I tested the memory configuration extensively, as I documented above. Even with other memory sticks. It's not that either. I'm not sure what it is at this point. The problem is really elusive and difficult to trace. Ugh! I hope you haven't caught the Gremlin! It's a really nasty one!

Other than the Raid card above, my system config is an AMD 5950X on the Gigabyte Aorus Master X570S mobo, 4x 128GB Corsair Vengeance LPX 3200, an EVGA 3080Ti with 12GB, a Corsair H170i Capellix CPU AIO cooler, and a Corsair AX1600i PS.

I really just over-dimensioned it for my needs, expecting to not have any problems with power delivery or CPU overheating. I used the Kryonaut thermal paste from Thermal Grizzly. The CPU stayed in the low 70's (Celsius) when running F@H, with 32 threads. That is well below the 95C max temp expected by AMD. I haven't overclocked it yet, and I don't expect to have to.

Well Happy Holidays everybody. Hopefully I will find the issue some time soon.

pcwolf · Post by **pcwolf** » Fri Dec 10, 2021 6:27 pm

No solution to offer, just another data point for you:

I run Manjaro Linux on Gigabyte x570S Aorus Master with Ryzen 5950x, NVidia RTX 2070 and GTX 1650, 16c/32t all F@H.
About every 2 to 4 days the system will spontaneously reboot, with errors I will add here to the end of my post. CMOS (ver 53c) is unaffected by reboot.
I have de-rated the RTX 2070 power using the "Coolbits" setting to run at 160W limit.
I have also set the UEFI/BIOS to run the Ryzen on ECO mode.
Both power settings give me nearly the PPD but run significantly cooler - CPU @ 52c and GPU @ 81c

Errors:
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000001000108
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: TSC 0 ADDR 7fbdc4a4bae1 MISC d012000200000000 SYND 4d000000 IPID 500b000000000
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1639063104 SOCKET 0 APIC 0 microcode a201016
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 5: bea0000000000108
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: TSC 0 ADDR 8a3a8a MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1639063104 SOCKET 0 APIC 18 microcode a201016

-Phil
Yorktown, Virginia, USA

aetch · Post by **aetch** » Fri Dec 10, 2021 6:48 pm

pcwolf wrote:About every 2 to 4 days the system will spontaneously reboot, with errors I will add here to the end of my post.

I really hate to say this but your system is unstable.
You really need to look at it and fix it before you do any more folding.
You should be able to run it continuously for weeks/months on end without issue.

Neil-B · Post by **Neil-B** » Fri Dec 10, 2021 7:07 pm

@marius @pcwolf

crop from https://wiki.archlinux.org/title/Ryzen ...

With Ryzen 5, particularly the enthusiast models of 5950X and 5900X there seem to be some slight instability issues under Linux, related possibly to the 5.11+ kernel, as shown by this kernel bug. After investigating and reading reports on the Internet I discovered that out of the box, windows seems to run the CPUs at higher voltage and lower peak frequencies, compared to the stock linux kernel, which depending on your draw from the silicone lottery could cause a host of random application crashes or hardware errors that lead to reboots. You will recognise those by dmesg logs that look like:

Code: Select all

kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 22: Machine Check: 0 Bank 1: bc800800060c0859
lightbringer kernel: mce: [Hardware Error]: TSC 0 ADDR 7ea8f5b00 MISC d012000000000000 IPID 100b000000000 
lightbringer kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1636645367 SOCKET 0 APIC d microcode a201016

The CPU ID and the Processor number may vary. To solve this problem you need to supply higher voltage to your CPU so that it is stable when running at peak frequencies.

Might be worth trying slightly higher voltages and discount the above as the cause.

Folding Forum

[Solved, kinda] Strange crash/reboot and CMOS corruption only with F@H

Re: Strange crash and CMOS corruption after switching to 308

Re: Strange crash and CMOS corruption after switching to 308

Re: Strange crash and CMOS corruption after switching to 308

Re: Strange crash and CMOS corruption after switching to 308

Re: Strange crash/reboot and CMOS corruption only with F@H

Re: Strange crash/reboot and CMOS corruption only with F@H

Re: Strange crash/reboot and CMOS corruption only with F@H

Re: Strange crash/reboot and CMOS corruption only with F@H

Re: Strange crash/reboot and CMOS corruption only with F@H

Re: Strange crash/reboot and CMOS corruption only with F@H

Re: Strange crash/reboot and CMOS corruption only with F@H

Re: Strange crash/reboot and CMOS corruption only with F@H

Re: Strange crash/reboot and CMOS corruption only with F@H

Re: Strange crash/reboot and CMOS corruption only with F@H

Re: Strange crash/reboot and CMOS corruption only with F@H