How does FAH protect against memory errors?
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 168
- Joined: Tue Apr 07, 2020 2:38 pm
How does FAH protect against memory errors?
Most servers & supercomputers use ECC to prevent memory errors, but FAH relies on mostly consumer grade hardware that generally does not use ECC. Is anything done to detect and protect against memory errors? Are memory errors not a big deal for this kind of system?
-
- Site Admin
- Posts: 7990
- Joined: Tue Apr 21, 2009 4:41 pm
- Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4 - Location: W. MA
Re: How does FAH protect against memory errors?
In the case of processing on GPUs, periodic sanity checks are done on the WU data with the calculations being done by the CPU. If that sanity check fails, the WU starts over at the previous checkpoint. Too many errors causing restarts and the WU is failed out and a report set back to get the WU assigned to another system.
For CPU processing these kind of errors will usually result in a fault condition such as a NaN error. Again there will be a restart from a prior checkpoint, etc.
All WUs get basic sanity and other checks when received by the servers. The next Gen WU is created from that return and sent out.
Ultimately the Markov State Model methods being used are statistical in nature, so individual trajectories are only part of the statistics being analyzed. With the calculations being spread over a range of systems, an error that escaped other checks should not be enough to change the final results.
For CPU processing these kind of errors will usually result in a fault condition such as a NaN error. Again there will be a restart from a prior checkpoint, etc.
All WUs get basic sanity and other checks when received by the servers. The next Gen WU is created from that return and sent out.
Ultimately the Markov State Model methods being used are statistical in nature, so individual trajectories are only part of the statistics being analyzed. With the calculations being spread over a range of systems, an error that escaped other checks should not be enough to change the final results.
-
- Posts: 168
- Joined: Tue Apr 07, 2020 2:38 pm
Re: How does FAH protect against memory errors?
Thanks Joe_H! That is super cool. I wish I could see how this is implemented in detail. While I'm a complete novice, probability theory is a big area of interest for me; particularly when applied to computers.
Re: How does FAH protect against memory errors?
I've ran GPU WUs for a few years, and I occasionally (perhaps 6 times a year est.) on errors that I can't classify.
Most of the errors happen due to incorrectly set voltages or overclock settings.
But those out of the equation, I think modern memory has come a long way.
Since the WUs are only inside the memory between 1 to 24 hours on most GPUs and CPUs, chances on errors also are lower.
If WUs were to process for days, ECC memory might be necessary.
Most of the errors happen due to incorrectly set voltages or overclock settings.
But those out of the equation, I think modern memory has come a long way.
Since the WUs are only inside the memory between 1 to 24 hours on most GPUs and CPUs, chances on errors also are lower.
If WUs were to process for days, ECC memory might be necessary.