[Solved, kinda] Strange crash/reboot and CMOS corruption only with F@H
Moderator: Site Moderators
Forum rules
Please read the forum rules before posting.
Please read the forum rules before posting.
Re: Strange crash/reboot and CMOS corruption only with F@H
@FrankMB
I ran Linpack Extreme in Linux at max settings for about 74 hours. Not a hiccup. This machine is very stable. The only software that causes the silent reset issue is still F@H. Not sure what else I can try.
Thanks,
Marius.
I ran Linpack Extreme in Linux at max settings for about 74 hours. Not a hiccup. This machine is very stable. The only software that causes the silent reset issue is still F@H. Not sure what else I can try.
Thanks,
Marius.
Re: Strange crash/reboot and CMOS corruption only with F@H
Reporting a similar issue: System reboot with a Machine Check Error (MCE) while running F@H for several hours (usually less than 24h running with 16 threads out of 32). No reboot with other stress tests (Prime95 and Linpack Xtreme running for days). Issue occurs on Linux and Windows alike. Tested with default overclock, underclock, memory at 3200 and 2400 MHz. All eventually fail with F@H while other stress tests are stable. There is no CMOS corruption (maybe due to the different mobo).
I suspect F@H is hitting a CPU issue. Maybe it is bad luck with my specimen. I wonder if other AMD Ryzen 9 5950X users did not experience such an issue.
A typical Linux message shown after reboot:
Which translates to:
A typical Windows error message:
Windows error event XML:
My setup:
Mobo: GIGABYTE B550 AORUS ELITE AX (latest BIOS)
CPU: AMD Ryzen 9 5950X
Memory: Kingston HyperX Predator RGB 2x32GB DDR4 3200MHz CL16
Cooler: Arctic Liquid Freezer II 120
PSU: Corsair TX650M Modular 650W Gold Active PFC 12cm Fan
I suspect F@H is hitting a CPU issue. Maybe it is bad luck with my specimen. I wonder if other AMD Ryzen 9 5950X users did not experience such an issue.
A typical Linux message shown after reboot:
Code: Select all
mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1643461814 SOCKET 0 APIC 14 microcode a201016
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 5: bea0000000000108
mce: [Hardware Error]: TSC 0 ADDR 8a3a5a MISC d012000100000000 SYND 4d000000 IPID 500b000000 000
Code: Select all
Machine check events logged
Hardware event. This is not a software error.
CPU 10 5 fixed-issue reoder
MISC d012000100000000 ADDR 8a3a5a
bit55 = res23
bit57 = processor context corrupt
bit59 = misc error valid
bit61 = error uncorrected
memory/cache 00000108 MCGSTATUS 0
SYND 4d000000 IPID 500b000000000
mcelog: Unknown CPU type vendor 2 family 25 model 1
Hardware event. This is not a software error.
CPU 0 0 data cache
TIME 1643461814 Sat Jan 29 15:10:14 2022
STATUS 0 MCGSTATUS 0
error 'generic error mem transaction, generic transaction, level 0'
STATUS bea00000CPUID Vendor AMD Family 25 Model 1 Step 0
SOCKET 0 APIC 14 microcode a201016
Code: Select all
Log Name: System
Source: Microsoft-Windows-WHEA-Logger
Date: 3/21/2022 4:05:37 PM
Event ID: 18
Task Category: None
Level: Error
Keywords:
User: LOCAL SERVICE
Computer: DESKTOP-UFRTUPL
Description:
A fatal hardware error has occurred.
Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 30
Code: Select all
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-WHEA-Logger" Guid="{c26c4f3c-3f66-4e99-8f8a-39405cfed220}" />
<EventID>18</EventID>
<Version>0</Version>
<Level>2</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2022-03-21T14:05:37.8429511Z" />
<EventRecordID>2618</EventRecordID>
<Correlation ActivityID="{b3bd9fc6-0489-4f3c-bee7-054ca83d87c7}" />
<Execution ProcessID="2424" ThreadID="4600" />
<Channel>System</Channel>
<Computer>DESKTOP-UFRTUPL</Computer>
<Security UserID="S-1-5-19" />
</System>
<EventData>
<Data Name="ErrorSource">3</Data>
<Data Name="ApicId">30</Data>
<Data Name="MCABank">5</Data>
<Data Name="MciStat">0xbea0000001000108</Data>
<Data Name="MciAddr">0x7ff64e605d51</Data>
<Data Name="MciMisc">0xd01a0ffe00000000</Data>
<Data Name="ErrorType">9</Data>
<Data Name="TransactionType">2</Data>
<Data Name="Participation">256</Data>
<Data Name="RequestType">0</Data>
<Data Name="MemorIO">256</Data>
<Data Name="MemHierarchyLvl">0</Data>
<Data Name="Timeout">256</Data>
<Data Name="OperationType">256</Data>
<Data Name="Channel">256</Data>
<Data Name="Length">936</Data>
<Data Name="RawData">435045521002FFFFFFFF03000100000002000000A80300001A050E00150316140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB57131FE6FF5E89C91C54CBA8865ABE14913BBA31E15B52C3DD80102000000000000000000000000000000000000000000000058010000C00000000003000001000000ADCC7698B447DB4BB65E16F193C4F3DB0000000000000000000000000000000001000000000000000000000000000000000000000000000018020000800000000003000000000000B0A03EDC44A19747B95B53FA242B6E1D0000000000000000000000000000000001000000000000000000000000000000000000000000000098020000100100000003000000000000011D1E8AF94257459C33565E5CC3F7E8000000000000000000000000000000000100000000000000000000000000000000000000000000007F010000000000000002010000000000100FA2000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001E00000000000000000000000000000000000000000000000000000000000000000000000000000007000000000000001E00000000000000100FA2000008201E0B32F87EFFFB8B170000000000000000000000000000000000000000000000000000000000000000F50157A5EFE3DE43AC72249B573FAD2C03000000000000009F00020600000000515D604EF67F00000000000000000000000000000000000000000000000000000200000002000000AD5ACBB62C3DD8011E0000000000000000000000000000000000000005000000080100010000A0BE515D604EF67F000000000000FE0F1AD0000000001E00000000000000B00005000000004D00000000F9010000230000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003B00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000</Data>
</EventData>
</Event>
Mobo: GIGABYTE B550 AORUS ELITE AX (latest BIOS)
CPU: AMD Ryzen 9 5950X
Memory: Kingston HyperX Predator RGB 2x32GB DDR4 3200MHz CL16
Cooler: Arctic Liquid Freezer II 120
PSU: Corsair TX650M Modular 650W Gold Active PFC 12cm Fan
Re: Strange crash/reboot and CMOS corruption only with F@H
Hi @mitalapo,
Thanks for the info. It seems there might be an issue when running F@H with 5950X's in Gigabyte MB's, you are the third person to report a similar problem in this post. I don't get any MCE's, the system just reboots silently, and only when running F@H.
Thanks for the info. It seems there might be an issue when running F@H with 5950X's in Gigabyte MB's, you are the third person to report a similar problem in this post. I don't get any MCE's, the system just reboots silently, and only when running F@H.
Re: Strange crash/reboot and CMOS corruption only with F@H
@Marius
Got this tip from the F@H discord channel:
Another comment:
Got this tip from the F@H discord channel:
My system is at the repair shop for a possible RMA, so I can't test it myself right now.FN4 on Discord wrote:In BIOS try setting power idle control to typical current idle. I have seen similar problems with Ryzen 3000 hardware
Another comment:
I noticed that Prime95 can be tuned to use AVX2 (the default would be AVX512 on the Ryzen 9), so this may worth a try too.FN4 on Discord wrote:F@H puts load on AVX2, which some other "stress tests" don't load
Re: Strange crash/reboot and CMOS corruption only with F@H
@mitalapo
Thanks for the info, will try those settings.
Thanks for the info, will try those settings.
Re: Strange crash/reboot and CMOS corruption only with F@H
I am sorry to report that setting "power idle control" to "typical current idle" in the BIOS did not prevent the crash
(though it took 4 days to crash compared to 1-2 days without this setting)
(though it took 4 days to crash compared to 1-2 days without this setting)
Re: Strange crash/reboot and CMOS corruption only with F@H
@mitalapo
Sorry to hear that. I also tried running the Linux Prime95 AVX stress tests, but that did not generate any problems, even after several hours. The mystery continues.
Sorry to hear that. I also tried running the Linux Prime95 AVX stress tests, but that did not generate any problems, even after several hours. The mystery continues.
Re: Strange crash/reboot and CMOS corruption only with F@H
@Marius
I am running the Prime95 stress test with AVX512 disabled (assuming this implies AVX2 is used more) for the last two weeks, without any issue. F@H remains the only way I know to crash the CPU.
I am running the Prime95 stress test with AVX512 disabled (assuming this implies AVX2 is used more) for the last two weeks, without any issue. F@H remains the only way I know to crash the CPU.
Re: Strange crash/reboot and CMOS corruption only with F@H
@mitalapo
Yes, that has been my experience as well. I hope we can find what the problem is. Thanks for sharing your data.
Yes, that has been my experience as well. I hope we can find what the problem is. Thanks for sharing your data.
-
- Posts: 2
- Joined: Mon Jun 06, 2022 12:14 pm
Re: Strange crash/reboot and CMOS corruption only with F@H
I have the same issues too. I have 2 desktop with 5950x CPU. One is MSI B550 motherboard, another is ASUS C8DH (Thor 1200W PSU and Acer Predator 3600c14, Colorful 3090 volcan OC, all parts are top quality things). I use many test tools on both computers and I use both computers for other science programs that run 24*7, they only get crashed when running FAH on CPU. In the C8DH Windows I found WHEA18 when the computer abnormally reboot. The WHEA18 occurs on random cores. I tried RMA new CPU, change new sets of RAMs, disable PBO, disable Turboboost, adjust voltage calibration, adding voltage, updating BIOS, change OS, reinstall OS, none of them help. I guess there are something about switching work status In the FAH code that are incompatable with using the whole 5950x CPU, this problem happens about 1 time per day(running FAH in full 24 hours).
Re: Strange crash/reboot and CMOS corruption only with F@H
Thanks for sharing your data, @ElephantFly. It appears FAH code really has problems with the 5950x. I hope developers are paying attention.
-
- Posts: 942
- Joined: Sun Dec 16, 2007 6:22 pm
- Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP - Location: London
- Contact:
Re: Strange crash/reboot and CMOS corruption only with F@H
FAH is not specifically causing this issue. FAH is using AVX256 instructions, which are taxing CPU like only few other power viruses do. The only thing which taxes CPU more is AVX512. The only solution for 5950x issue fix is RMA with AMD, they send you new one. test that if that has same issue, RMA again.
I am now on my 3rd 5950x, and finally this one seems to be stable under FAH.
5900x has similar issues, though far less widespread
5800x3d seems to be stable so far
FAH Omega tester
Re: Strange crash/reboot and CMOS corruption only with F@H
@muziqaz
Actually, this problem had started some time ago on my previous system, based on the AMD 1950x, on an Asus board. It showed the same symptoms, and that's why I upgraded my system. No other software causes this problem, so I doubt this is hardware. I ran AVX benchmarks on Linux and had no problems either.
Actually, this problem had started some time ago on my previous system, based on the AMD 1950x, on an Asus board. It showed the same symptoms, and that's why I upgraded my system. No other software causes this problem, so I doubt this is hardware. I ran AVX benchmarks on Linux and had no problems either.
-
- Posts: 942
- Joined: Sun Dec 16, 2007 6:22 pm
- Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP - Location: London
- Contact:
Re: Strange crash/reboot and CMOS corruption only with F@H
No other software is hitting cache like F@H. Benchmarks are to benchmark very specific things, not to simulate complicated proteins. Linpaks, prime95s are ancient pieces of code with patches to include AVX2. Is there much of cache communication in Prime95 as it is in F@H? Doubt it.
I believe I've seen someone crashing their 5000 series with Blender, since that can be set up to render for months on the CPU. Also, we, FAH people are only select few in the sea of other 5000 series users who are having same exact WHEA error issues while doing other stuff, or doing nothing. Majority of WHEA Logger victims never folded in their life. If it was software thing, 5800x3d would be crashing as well, also my 3rd 5950x would be also crashing. But they are not.
I believe I've seen someone crashing their 5000 series with Blender, since that can be set up to render for months on the CPU. Also, we, FAH people are only select few in the sea of other 5000 series users who are having same exact WHEA error issues while doing other stuff, or doing nothing. Majority of WHEA Logger victims never folded in their life. If it was software thing, 5800x3d would be crashing as well, also my 3rd 5950x would be also crashing. But they are not.
FAH Omega tester
Re: Strange crash/reboot and CMOS corruption only with F@H
@muziqaz
Thanks for sharing your data. I'm going to wait until Zen4 comes out and upgrade my system then. If it is a hardware issue, hopefully it will be gone in the new arch. Meanwhile, I'm contributing to other Boinc projects like Rosetta Stone and GPU Grid, without problems. F@H will have to wait until then.
Thanks for sharing your data. I'm going to wait until Zen4 comes out and upgrade my system then. If it is a hardware issue, hopefully it will be gone in the new arch. Meanwhile, I'm contributing to other Boinc projects like Rosetta Stone and GPU Grid, without problems. F@H will have to wait until then.