[Solved, kinda] Strange crash/reboot and CMOS corruption only with F@H
Posted: Thu Nov 04, 2021 3:59 am
[UPDATE, 10/25/22]: @FrankMB has reported that by just changing his motherboard, the problem was solved. My suspicions of the problem being in software were wrong. I replaced the CPU with an AMD 7950X, got a new motherboard from Asus and new DDR5 RAM, and the problem has gone away. My suspicions now fall on the power delivery circuits of the Gigabyte X570S Aorus Master motherboard, which seem to fail under high load. Thanks to every one who contributed.
I've been supporting F@H since the beginning of the pandemic, around March 2020. My Gentoo Linux box was an AMD Threadripper 1950x, on an Asus Zenith Extreme motherboard, 128GB RAM and two Nvidia 1070 FE cards, in CUDA mode. I used this system for about a year and a half, running with the F@H Linux CPU and GPU clients without problems.
A few weeks ago I upgraded the GPU, from the 2 1070 GPU's to a single 3080 Ti card from EVGA, with the FTW3 cooling system. The single 3080 Ti in CUDA mode is around three times faster than the 2 1070's, but then I started experiencing strange crashes where the PC rebooted inexplicably when Folding. I could not detect any errors in the logs, the system just rebooted. After that, the BIOS/UEFI got stuck at some point trying to find the video card, and all I got was a black screen.
What I found out is that the CMOS/NVRAM had been corrupted, and it was only possible to reboot if I had cleared the CMOS and reconfigured the BIOS data. I checked the battery and replaced it with another one, just in case, but the battery was not the problem.
The reboot problem started happening more frequently, and randomly. It could happen only minutes after starting the F@H client, or hours later. After a few days of this, the motherboard wouldn't boot anymore, not even after clearing the CMOS. The diagnostic display shows that the CMOS is corrupted, and now I can't even get to the UEFI configuration screen.
So I replaced that system with an AMD 5950X, a Gigabyte X570S Aorus Master motherboard but kept the other components. I ran memtest86 overnight to make sure the memory timing was correct, and found no problem. As soon as I configured the system and started Folding, the crash/reboot/corrupt cmos problem happened again. And the symptoms were the same; I had to clear the CMOS to be able to reboot.
I started suspecting the power supply, which is an EVGA Supernova 1600 P2. But it was operating at only around 50% capacity, with the current configuration. To test that scenario, I booted into Windows and ran the FurMark benchmark full-screen at 4K resolution, with the CPU burner running in the background, with all 16 cores on. I left this running for about 10 hours, burning around 850W, without problems. That meant that the power supply, or any of the hardware components, were not the problem.
So I booted back into Linux, and started the F@H client again, but this time running only the CPU threads, with the GPU idle. It ran for about 48 hours, but then the crash/reboot/corrupt cmos problem happened again. I also tested running from 16 to 32 CPU threads, with the same results. While testing only the GPU client, it would crash within minutes to a few hours.
I then tested the Windows F@H client, in CPU only, GPU only and CPU + GPU configurations. The results were the same as the Linux client; I would get the same crash/reboot/corrupt cmos problem.
It might be worth mentioning that I did not overclock the CPU or GPU, and that the temps for the CPU were about 70C when running all 32 threads, which is acceptable. The GPU was around 82C, which is also in the acceptable range.
So now I started suspecting that the F@H client itself is causing the problem somehow. To test this, I joined the BOINC/Rosetta project and ran their CPU client for several days, in both Windows and Linux, and found no problems.
Given that I tested all my hardware with different tools and found no obvious problem, and that the crash/reboot/corrupt cmos problem only happens when running the F@H client, I think that there is some software problem with the F@H client. The version I tested with was 7.6.13, obtained from the Gentoo repository (sci-biology/foldingathome-7.6.13-r1). The Windows version was the same.
Unfortunately, I really can't provide much more information about the crash, because it simply reboots, not leaving any messages in the log file. And thus I will not be able to continue to contribute to this project, until this crash/reboot/corrupt cmos problem is diagnosed and fixed.
I've been supporting F@H since the beginning of the pandemic, around March 2020. My Gentoo Linux box was an AMD Threadripper 1950x, on an Asus Zenith Extreme motherboard, 128GB RAM and two Nvidia 1070 FE cards, in CUDA mode. I used this system for about a year and a half, running with the F@H Linux CPU and GPU clients without problems.
A few weeks ago I upgraded the GPU, from the 2 1070 GPU's to a single 3080 Ti card from EVGA, with the FTW3 cooling system. The single 3080 Ti in CUDA mode is around three times faster than the 2 1070's, but then I started experiencing strange crashes where the PC rebooted inexplicably when Folding. I could not detect any errors in the logs, the system just rebooted. After that, the BIOS/UEFI got stuck at some point trying to find the video card, and all I got was a black screen.
What I found out is that the CMOS/NVRAM had been corrupted, and it was only possible to reboot if I had cleared the CMOS and reconfigured the BIOS data. I checked the battery and replaced it with another one, just in case, but the battery was not the problem.
The reboot problem started happening more frequently, and randomly. It could happen only minutes after starting the F@H client, or hours later. After a few days of this, the motherboard wouldn't boot anymore, not even after clearing the CMOS. The diagnostic display shows that the CMOS is corrupted, and now I can't even get to the UEFI configuration screen.
So I replaced that system with an AMD 5950X, a Gigabyte X570S Aorus Master motherboard but kept the other components. I ran memtest86 overnight to make sure the memory timing was correct, and found no problem. As soon as I configured the system and started Folding, the crash/reboot/corrupt cmos problem happened again. And the symptoms were the same; I had to clear the CMOS to be able to reboot.
I started suspecting the power supply, which is an EVGA Supernova 1600 P2. But it was operating at only around 50% capacity, with the current configuration. To test that scenario, I booted into Windows and ran the FurMark benchmark full-screen at 4K resolution, with the CPU burner running in the background, with all 16 cores on. I left this running for about 10 hours, burning around 850W, without problems. That meant that the power supply, or any of the hardware components, were not the problem.
So I booted back into Linux, and started the F@H client again, but this time running only the CPU threads, with the GPU idle. It ran for about 48 hours, but then the crash/reboot/corrupt cmos problem happened again. I also tested running from 16 to 32 CPU threads, with the same results. While testing only the GPU client, it would crash within minutes to a few hours.
I then tested the Windows F@H client, in CPU only, GPU only and CPU + GPU configurations. The results were the same as the Linux client; I would get the same crash/reboot/corrupt cmos problem.
It might be worth mentioning that I did not overclock the CPU or GPU, and that the temps for the CPU were about 70C when running all 32 threads, which is acceptable. The GPU was around 82C, which is also in the acceptable range.
So now I started suspecting that the F@H client itself is causing the problem somehow. To test this, I joined the BOINC/Rosetta project and ran their CPU client for several days, in both Windows and Linux, and found no problems.
Given that I tested all my hardware with different tools and found no obvious problem, and that the crash/reboot/corrupt cmos problem only happens when running the F@H client, I think that there is some software problem with the F@H client. The version I tested with was 7.6.13, obtained from the Gentoo repository (sci-biology/foldingathome-7.6.13-r1). The Windows version was the same.
Unfortunately, I really can't provide much more information about the crash, because it simply reboots, not leaving any messages in the log file. And thus I will not be able to continue to contribute to this project, until this crash/reboot/corrupt cmos problem is diagnosed and fixed.