Unless Furmark's CPU Burner does AVX, AVX_256 and AVX_512 loads, then it's not a realistic benchmark for folding or other vector operations. Please try Prime95's AVX tests alone or at the same time as the Furmark GPU benchmark if you want to replicate a similar load as folding.Marius wrote:For example, I ran FurMark, which really taxes the GPU, while at the same time running its "CPU Burner" benchmark in the background, with 32 threads @100% utilization, for 10 hours.
[Solved, kinda] Strange crash/reboot and CMOS corruption only with F@H
Moderator: Site Moderators
Forum rules
Please read the forum rules before posting.
Please read the forum rules before posting.
Re: Strange crash and CMOS corruption after switching to 308
Online: GTX 1660 Super + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 1050 Ti 4G OC, RX580
Re: Strange crash and CMOS corruption after switching to 308
@gunnarre
I just ran a 17-hour stress test as you suggested, with Prime95 AVX on the CPU and FurMark on the GPU. The PC was drawing about 650W as measured by the AX1600i PSU. There were no problems, and the PC didn't crash/reboot as it does with the F@H client. So it seems really stable hardware wise. But thanks for yet another idea for testing.
I just ran a 17-hour stress test as you suggested, with Prime95 AVX on the CPU and FurMark on the GPU. The PC was drawing about 650W as measured by the AX1600i PSU. There were no problems, and the PC didn't crash/reboot as it does with the F@H client. So it seems really stable hardware wise. But thanks for yet another idea for testing.
Re: Strange crash and CMOS corruption after switching to 308
It is the memory. Memtest may find a faulty module, but instability is another bag.Marius wrote:I ran memtest86 overnight to make sure the memory timing was correct, and found no problem. As soon as I configured the system and started Folding, the crash/reboot/corrupt cmos problem happened again. And the symptoms were the same; I had to clear the CMOS to be able to reboot.
Do you have four modules? If you use the XMP profiles, reduce them to two modules.
If you need four, then try the motherboard defaults, but for Folding, you certainly will not need much.
Re: Strange crash and CMOS corruption after switching to 308
@JimF
[UPDATE 11/23/21]: So the system fooled me into thinking it was memory timing. A few hours later, the system crashed again. With DDR4 2100 settings. Case NOT closed!
Yes, I have 4 sticks of DDR4 that were set for XMP 3200. After I set it back to 2100, it has been running for 42 hours without problems.
[UPDATE 11/23/21]: So the system fooled me into thinking it was memory timing. A few hours later, the system crashed again. With DDR4 2100 settings. Case NOT closed!
Yes, I have 4 sticks of DDR4 that were set for XMP 3200. After I set it back to 2100, it has been running for 42 hours without problems.
Re: Strange crash/reboot and CMOS corruption only with F@H
Try running with just two sticks, then the other two sticks.
Online: GTX 1660 Super + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 1050 Ti 4G OC, RX580
Re: Strange crash/reboot and CMOS corruption only with F@H
Also try un-overclocking.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
Re: Strange crash/reboot and CMOS corruption only with F@H
Yes, testing that now. We'll see.gunnarre wrote:Try running with just two sticks, then the other two sticks.
This system is not overclocked.bruce wrote:Also try un-overclocking.
Thanks for all the ideas!
Re: Strange crash/reboot and CMOS corruption only with F@H
Well, I split the original 4 sticks of 32GB dual rank DDR4 into 2 groups, and tested each separately at 2100 timings. No deal, both tests crashed and rebooted after a few hours. The CMOS was again corrupted, resulting in an unbootable system. CMOS clearing is necessary to get past that. I also tested a different set of 2 sticks of 8GB single rank DDR4 3600, but at 2100. Nope, that didn't work either. It rebooted in a mere 10 minutes, while I was looking at the screen monitoring temps, which were OK. So at this point I can say for sure it's not memory instability, or a bad stick of RAM, or any kind of hardware problem on my side. Again, the crash only occurs when using F@H. Since I have already replaced and tested __every__ __single__ system component, I'm running out of ideas. Anyway, happy Thanksgiving to you guys in the US.
-
- Site Moderator
- Posts: 6349
- Joined: Sun Dec 02, 2007 10:38 am
- Location: Bordeaux, France
- Contact:
Re: Strange crash/reboot and CMOS corruption only with F@H
If I were you, I'd get my motherboard replaced ... it's not a normal behaviour.
Re: Strange crash/reboot and CMOS corruption only with F@H
I already replaced the mobo! It's the second one that displays the same behavior, and is totally unrelated to the first, which was a Zenith Extreme for the AMD Threadripper from Asus. The current one is a Gigabyte Aorus Master x570s that came out in June, which I use with an AMD 5950x. So, different processor and mobo!! It looks like I have a transmissible Gremlin!
Re: Strange crash/reboot and CMOS corruption only with F@H
You haven't mentioned the disk drive, which I assume is an SSD.
I don't think that I have ever had one go bad on me, but it looks like you have eliminated all other possibilities.
(It isn't FAH).
Also, I always use a write cache to protect the SSD. On Ubuntu, I use the built-in Linux write cache.
https://lonesysadmin.net/2013/12/22/bet ... rty_ratio/
On Win10, I use PrimoCache.
https://www.romexsoftware.com/en-us/pri ... index.html
It is not necessary for FAH, which is very light on writes, but for some of the BOINC projects it is.
It looks like the Rosetta pythons are bad, though I don't know how bad yet.
I use at least 2 GB of memory for the cache, and at least 30 minutes of latency (write-delay), though less will help.
PS - You may have transmitted it to my Ryzen 3950X, which started seizing up a few days ago, but it has 128 GB of memory and I suspect one stick may be bad.
But try to deal with it at you end, please.
I don't think that I have ever had one go bad on me, but it looks like you have eliminated all other possibilities.
(It isn't FAH).
Also, I always use a write cache to protect the SSD. On Ubuntu, I use the built-in Linux write cache.
https://lonesysadmin.net/2013/12/22/bet ... rty_ratio/
On Win10, I use PrimoCache.
https://www.romexsoftware.com/en-us/pri ... index.html
It is not necessary for FAH, which is very light on writes, but for some of the BOINC projects it is.
It looks like the Rosetta pythons are bad, though I don't know how bad yet.
I use at least 2 GB of memory for the cache, and at least 30 minutes of latency (write-delay), though less will help.
PS - You may have transmitted it to my Ryzen 3950X, which started seizing up a few days ago, but it has 128 GB of memory and I suspect one stick may be bad.
But try to deal with it at you end, please.
Re: Strange crash/reboot and CMOS corruption only with F@H
@JimF
The drive is an Areca SSD RAID card, with battery back-up and a bank of super-capacitors to protect the write-cache. That card hasn't given any problems yet. To remove it and test with another drive will be a chore that I don't have time yet to do. Maybe on the holiday breaks.
But yes, I also have 128GB of RAM. I tested the memory configuration extensively, as I documented above. Even with other memory sticks. It's not that either. I'm not sure what it is at this point. The problem is really elusive and difficult to trace. Ugh! I hope you haven't caught the Gremlin! It's a really nasty one!
Other than the Raid card above, my system config is an AMD 5950X on the Gigabyte Aorus Master X570S mobo, 4x 128GB Corsair Vengeance LPX 3200, an EVGA 3080Ti with 12GB, a Corsair H170i Capellix CPU AIO cooler, and a Corsair AX1600i PS.
I really just over-dimensioned it for my needs, expecting to not have any problems with power delivery or CPU overheating. I used the Kryonaut thermal paste from Thermal Grizzly. The CPU stayed in the low 70's (Celsius) when running F@H, with 32 threads. That is well below the 95C max temp expected by AMD. I haven't overclocked it yet, and I don't expect to have to.
Well Happy Holidays everybody. Hopefully I will find the issue some time soon.
The drive is an Areca SSD RAID card, with battery back-up and a bank of super-capacitors to protect the write-cache. That card hasn't given any problems yet. To remove it and test with another drive will be a chore that I don't have time yet to do. Maybe on the holiday breaks.
But yes, I also have 128GB of RAM. I tested the memory configuration extensively, as I documented above. Even with other memory sticks. It's not that either. I'm not sure what it is at this point. The problem is really elusive and difficult to trace. Ugh! I hope you haven't caught the Gremlin! It's a really nasty one!
Other than the Raid card above, my system config is an AMD 5950X on the Gigabyte Aorus Master X570S mobo, 4x 128GB Corsair Vengeance LPX 3200, an EVGA 3080Ti with 12GB, a Corsair H170i Capellix CPU AIO cooler, and a Corsair AX1600i PS.
I really just over-dimensioned it for my needs, expecting to not have any problems with power delivery or CPU overheating. I used the Kryonaut thermal paste from Thermal Grizzly. The CPU stayed in the low 70's (Celsius) when running F@H, with 32 threads. That is well below the 95C max temp expected by AMD. I haven't overclocked it yet, and I don't expect to have to.
Well Happy Holidays everybody. Hopefully I will find the issue some time soon.
-
- Posts: 61
- Joined: Fri Apr 03, 2020 4:49 pm
- Hardware configuration: Manjaro Linux - AsRock B550 Taichi - Ryzen 5950X - NVidia RTX 4070ti
FAH v8-4.3 - Location: Yorktown, Virginia, USA
Re: Strange crash/reboot and CMOS corruption only with F@H
No solution to offer, just another data point for you:
I run Manjaro Linux on Gigabyte x570S Aorus Master with Ryzen 5950x, NVidia RTX 2070 and GTX 1650, 16c/32t all F@H.
About every 2 to 4 days the system will spontaneously reboot, with errors I will add here to the end of my post. CMOS (ver 53c) is unaffected by reboot.
I have de-rated the RTX 2070 power using the "Coolbits" setting to run at 160W limit.
I have also set the UEFI/BIOS to run the Ryzen on ECO mode.
Both power settings give me nearly the PPD but run significantly cooler - CPU @ 52c and GPU @ 81c
Errors:
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000001000108
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: TSC 0 ADDR 7fbdc4a4bae1 MISC d012000200000000 SYND 4d000000 IPID 500b000000000
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1639063104 SOCKET 0 APIC 0 microcode a201016
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 5: bea0000000000108
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: TSC 0 ADDR 8a3a8a MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1639063104 SOCKET 0 APIC 18 microcode a201016
-Phil
Yorktown, Virginia, USA
I run Manjaro Linux on Gigabyte x570S Aorus Master with Ryzen 5950x, NVidia RTX 2070 and GTX 1650, 16c/32t all F@H.
About every 2 to 4 days the system will spontaneously reboot, with errors I will add here to the end of my post. CMOS (ver 53c) is unaffected by reboot.
I have de-rated the RTX 2070 power using the "Coolbits" setting to run at 160W limit.
I have also set the UEFI/BIOS to run the Ryzen on ECO mode.
Both power settings give me nearly the PPD but run significantly cooler - CPU @ 52c and GPU @ 81c
Errors:
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000001000108
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: TSC 0 ADDR 7fbdc4a4bae1 MISC d012000200000000 SYND 4d000000 IPID 500b000000000
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1639063104 SOCKET 0 APIC 0 microcode a201016
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 5: bea0000000000108
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: TSC 0 ADDR 8a3a8a MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1639063104 SOCKET 0 APIC 18 microcode a201016
-Phil
Yorktown, Virginia, USA
Re: Strange crash/reboot and CMOS corruption only with F@H
I really hate to say this but your system is unstable.pcwolf wrote:About every 2 to 4 days the system will spontaneously reboot, with errors I will add here to the end of my post.
You really need to look at it and fix it before you do any more folding.
You should be able to run it continuously for weeks/months on end without issue.
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: Strange crash/reboot and CMOS corruption only with F@H
@marius @pcwolf
crop from https://wiki.archlinux.org/title/Ryzen ...
With Ryzen 5, particularly the enthusiast models of 5950X and 5900X there seem to be some slight instability issues under Linux, related possibly to the 5.11+ kernel, as shown by this kernel bug. After investigating and reading reports on the Internet I discovered that out of the box, windows seems to run the CPUs at higher voltage and lower peak frequencies, compared to the stock linux kernel, which depending on your draw from the silicone lottery could cause a host of random application crashes or hardware errors that lead to reboots. You will recognise those by dmesg logs that look like:
The CPU ID and the Processor number may vary. To solve this problem you need to supply higher voltage to your CPU so that it is stable when running at peak frequencies.
Might be worth trying slightly higher voltages and discount the above as the cause.
crop from https://wiki.archlinux.org/title/Ryzen ...
With Ryzen 5, particularly the enthusiast models of 5950X and 5900X there seem to be some slight instability issues under Linux, related possibly to the 5.11+ kernel, as shown by this kernel bug. After investigating and reading reports on the Internet I discovered that out of the box, windows seems to run the CPUs at higher voltage and lower peak frequencies, compared to the stock linux kernel, which depending on your draw from the silicone lottery could cause a host of random application crashes or hardware errors that lead to reboots. You will recognise those by dmesg logs that look like:
Code: Select all
kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 22: Machine Check: 0 Bank 1: bc800800060c0859
lightbringer kernel: mce: [Hardware Error]: TSC 0 ADDR 7ea8f5b00 MISC d012000000000000 IPID 100b000000000
lightbringer kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1636645367 SOCKET 0 APIC d microcode a201016
Might be worth trying slightly higher voltages and discount the above as the cause.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)