F@H pausing itself?

Moderators: Site Moderators, FAHC Science Team

muziqaz
Posts: 1544
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 9950x, 7950x3D, 5950x, 5800x3D
7900xtx, RX9070, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: F@H pausing itself?

Post by muziqaz »

It can be RAM, CPU memory controller, but most likely GPU, GPU mem controller, VRAM, GPU VRM.
If it was Linux, we could easily blame FAHcores, but in windows things are relatively stable on that front.
FAH Omega tester
Image
vica153
Posts: 31
Joined: Thu Mar 19, 2020 7:29 pm

Re: F@H pausing itself?

Post by vica153 »

I upped the GPU voltage by 6mV to 963mV@1801MHz and its been stable for ~50WU over the last few weeks. So apparently my perfectly stable GPU wasn't as perfectly stable as I had thought. Interesting that F@H seems to be more sensitive than any other usage.
muziqaz
Posts: 1544
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 9950x, 7950x3D, 5950x, 5800x3D
7900xtx, RX9070, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: F@H pausing itself?

Post by muziqaz »

It is not sensitive, it is just properly loading your hardware
FAH Omega tester
Image
arisu
Posts: 264
Joined: Mon Feb 24, 2025 11:11 pm

Re: F@H pausing itself?

Post by arisu »

Mild instability when playing video games will cause artifacts that you probably won't notice. Mild instability when folding can cause mistakes in the simulation that can make it converge to an impossible or unrealistic state that will be caught by sanity checks (in this case the position of a particle has become NaN). Folding doesn't make a system less stable, but it will catch small instabilities that other usages will not.

Code: Select all

05:04:10:I1:WU145:An exception occurred at step 17067: Particle coordinate is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan
05:04:10:I1:WU145:Max number of attempts to resume from last checkpoint (2) reached. Aborting.
I think this means that the incorrect calculation happened before the last checkpoint and the bad simulation state was saved to the checkpoint. When it tried to resume the checkpoint with the bad data, it converged into a state where a particle's position was NaN (an invalid floating point number). It retried twice and reached that state each time, so the core concluded that the checkpoint itself had bad data (which was probably true).
Post Reply