Page 1 of 1

How does FAH protect against memory errors?

Posted: Wed May 06, 2020 1:50 pm
by NoMoreQuarantine
Most servers & supercomputers use ECC to prevent memory errors, but FAH relies on mostly consumer grade hardware that generally does not use ECC. Is anything done to detect and protect against memory errors? Are memory errors not a big deal for this kind of system?

Re: How does FAH protect against memory errors?

Posted: Wed May 06, 2020 2:12 pm
by Joe_H
In the case of processing on GPUs, periodic sanity checks are done on the WU data with the calculations being done by the CPU. If that sanity check fails, the WU starts over at the previous checkpoint. Too many errors causing restarts and the WU is failed out and a report set back to get the WU assigned to another system.

For CPU processing these kind of errors will usually result in a fault condition such as a NaN error. Again there will be a restart from a prior checkpoint, etc.

All WUs get basic sanity and other checks when received by the servers. The next Gen WU is created from that return and sent out.

Ultimately the Markov State Model methods being used are statistical in nature, so individual trajectories are only part of the statistics being analyzed. With the calculations being spread over a range of systems, an error that escaped other checks should not be enough to change the final results.

Re: How does FAH protect against memory errors?

Posted: Wed May 06, 2020 2:22 pm
by NoMoreQuarantine
Thanks Joe_H! That is super cool. I wish I could see how this is implemented in detail. While I'm a complete novice, probability theory is a big area of interest for me; particularly when applied to computers.

Re: How does FAH protect against memory errors?

Posted: Wed May 06, 2020 4:06 pm
by MeeLee
I've ran GPU WUs for a few years, and I occasionally (perhaps 6 times a year est.) on errors that I can't classify.
Most of the errors happen due to incorrectly set voltages or overclock settings.
But those out of the equation, I think modern memory has come a long way.
Since the WUs are only inside the memory between 1 to 24 hours on most GPUs and CPUs, chances on errors also are lower.
If WUs were to process for days, ECC memory might be necessary.