Linux reset procedure on stalled WUs due to FAH server error
Posted: Sat Oct 19, 2019 8:45 pm
If you've experienced stalled WUs due to FAH server errors, there are a couple of ways to go about.
I am trying to better understand the procedures, and find an optimized way to restart the system, without wasting too much time.
If you're like me, and folding on Linux, with GPUs being overclocked, and set to custom fanspeeds, you'll know that restarting the whole procedure takes few minutes out of your time.
You may have tried to restart FAHClient, however, for some reason this doesn't seem to work. FahControl still is stuck on previous WUs, and new slots aren't available, causing control to freeze (in a place where there are no GPUs displayed, waiting for web input).
/etc/init.d/FAHClient log occasionally is able to recapture the database, but in most cases you'll be greeted with a 'database locked' message, and FAH can't continue.
1- The most common procedure is to just restart your pc.
The drawback is two fold.
a. At shutdown, FAHClient hangs, and will take several minutes to shut down the PC, unless you 'hard reset' the pc.
b. At startup, the OC procedure and fan speed, as well as power cap levels need to be adjusted. taking well over 5 minutes per single GPU, and 10 minutes for multi GPU systems.
2- Dump slot, and reintroduce them.
The pros are that GPU fan curves, OC values, and power caps don't need to be readjusted.
This works for occasionally stalled WUs. (eg: in a 4GPU setup, if 1 or 2 GPUs are stalled.
The con in this method is that you potentially might dump a fuly processed, and perhaps perfectly good WU, (@Bruce, any idea if this is the case?)
3- Kill FAHClient,
Rather than the above 2 solutions, I have found it far more easy to just kill FAHClient.
Lubuntu offers 'qps' as taskmanager (do: "sudo qps" in terminal to get root elevation),
Ubuntu probably uses gnome-system-monitor.
Once started as elevated (root), right click on any fahclient processes, and kill them.
If you are running headless, you can also use 'sudo top -u fahclient' to locate a process ID (PID), and use that to kill the FAH processes ("sudo kill #####", in which '#####' is the PID FAH is running at)
Then go back to Terminal and start fahcllient again (sudo /etc/init.d/FAHClient start)
I have found this works the best, without rebooting the system, and without the risk of throwing away any processed WUs.
It will release the database lock on WUs.
FAHControl will (*should) show that all inactive GPU slots are downloading actual WUs, and start them as soon as they're ready.
I yet have to fine-tune my solution, in as to what the exact name is, to kill; as in my case (4GPU system) there are 6 FAHClient processes.
The reason I didn't go with htop, is because HTOP shows A LOT of FAHClient PIDs, and it'll be hard to determine which ones to kill.
But perhaps when I find out exactly which process name of all it is to kill, even HTOP can be used.
I am trying to better understand the procedures, and find an optimized way to restart the system, without wasting too much time.
If you're like me, and folding on Linux, with GPUs being overclocked, and set to custom fanspeeds, you'll know that restarting the whole procedure takes few minutes out of your time.
You may have tried to restart FAHClient, however, for some reason this doesn't seem to work. FahControl still is stuck on previous WUs, and new slots aren't available, causing control to freeze (in a place where there are no GPUs displayed, waiting for web input).
/etc/init.d/FAHClient log occasionally is able to recapture the database, but in most cases you'll be greeted with a 'database locked' message, and FAH can't continue.
1- The most common procedure is to just restart your pc.
The drawback is two fold.
a. At shutdown, FAHClient hangs, and will take several minutes to shut down the PC, unless you 'hard reset' the pc.
b. At startup, the OC procedure and fan speed, as well as power cap levels need to be adjusted. taking well over 5 minutes per single GPU, and 10 minutes for multi GPU systems.
2- Dump slot, and reintroduce them.
The pros are that GPU fan curves, OC values, and power caps don't need to be readjusted.
This works for occasionally stalled WUs. (eg: in a 4GPU setup, if 1 or 2 GPUs are stalled.
The con in this method is that you potentially might dump a fuly processed, and perhaps perfectly good WU, (@Bruce, any idea if this is the case?)
3- Kill FAHClient,
Rather than the above 2 solutions, I have found it far more easy to just kill FAHClient.
Lubuntu offers 'qps' as taskmanager (do: "sudo qps" in terminal to get root elevation),
Ubuntu probably uses gnome-system-monitor.
Once started as elevated (root), right click on any fahclient processes, and kill them.
If you are running headless, you can also use 'sudo top -u fahclient' to locate a process ID (PID), and use that to kill the FAH processes ("sudo kill #####", in which '#####' is the PID FAH is running at)
Then go back to Terminal and start fahcllient again (sudo /etc/init.d/FAHClient start)
I have found this works the best, without rebooting the system, and without the risk of throwing away any processed WUs.
It will release the database lock on WUs.
FAHControl will (*should) show that all inactive GPU slots are downloading actual WUs, and start them as soon as they're ready.
I yet have to fine-tune my solution, in as to what the exact name is, to kill; as in my case (4GPU system) there are 6 FAHClient processes.
The reason I didn't go with htop, is because HTOP shows A LOT of FAHClient PIDs, and it'll be hard to determine which ones to kill.
But perhaps when I find out exactly which process name of all it is to kill, even HTOP can be used.