I am an HPC sysadmin at a University, and often run FAH 7.x in slurm jobs for both testing and for making use of idle nodes. To do this, a slurm job launches a container with FAH 7 installed, and runs FAHClient with the options "--exit-when-done --max-units 1 --max-queue 1"
These options don't exist in 8.x, it would be great if 8.x could have some additional capabilities to limit WU so a slurm job could have a finite amount to do in X time.
Running FAH 8.x in slurm jobs
Moderators: Site Moderators, FAHC Science Team
-
calxalot
- Site Moderator
- Posts: 1840
- Joined: Sat Dec 08, 2007 1:33 am
- Location: San Francisco, CA
- Contact:
Re: Running FAH 8.x in slurm jobs
Relevant enhancement request: https://github.com/FoldingAtHome/fah-cl ... issues/251
See also https://github.com/JWhyFR/fah-v8
I think JWhy could help you with v8.
Meanwhile, there is nothing wrong with using v7.
See also https://github.com/JWhyFR/fah-v8
I think JWhy could help you with v8.
Meanwhile, there is nothing wrong with using v7.
Re: Running FAH 8.x in slurm jobs
I can try to help but I am no expert ... and I didn't even know v7 had these options
Since none of this exists in V8, I think we could create a startup script (adapted from firedfly’s or mine) that would:
- install python3 and the other prerequisites
- install lufah
- download F@H V8
- configure (with account token + other parameters : gpu only ? cpu only ? ) and launch F@H
- wait a few moments, then check with lufah to see if a work unit has been downloaded and is being processed
- if so, send a "finish" command via lufah
- then wait until ( = check with lufah in a loop, every X minutes, ) the status is "paused" and there are no "units" in stock
- check if the WU has been credited (with lufah history, probably)
- and when all this is ok and if it's necessary to explicitly stop fahclient: kill the fahclient process !
NB : a few things to adjust if you're doing calculations on a multi-GPU setup
Let us know if you think this could work with your setup.
Code: Select all
exit-when-done <boolean=false>
Exit when all slots are paused.
max-queue <integer=16>
Maximum units per slot in the work queue.
max-units <integer=0>
Process at most this number of units, then pause.- install python3 and the other prerequisites
- install lufah
- download F@H V8
- configure (with account token + other parameters : gpu only ? cpu only ? ) and launch F@H
- wait a few moments, then check with lufah to see if a work unit has been downloaded and is being processed
- if so, send a "finish" command via lufah
- then wait until ( = check with lufah in a loop, every X minutes, ) the status is "paused" and there are no "units" in stock
- check if the WU has been credited (with lufah history, probably)
- and when all this is ok and if it's necessary to explicitly stop fahclient: kill the fahclient process !
NB : a few things to adjust if you're doing calculations on a multi-GPU setup
Let us know if you think this could work with your setup.
Last edited by JWhy on Thu Apr 30, 2026 10:48 am, edited 1 time in total.
-
calxalot
- Site Moderator
- Posts: 1840
- Joined: Sat Dec 08, 2007 1:33 am
- Location: San Francisco, CA
- Contact:
Re: Running FAH 8.x in slurm jobs
Although it is not 100% reliable, there is also
which I sometimes use after sending finish.
One could stop the client job after it becomes paused. Or kill -TERM if not using systemd.
Code: Select all
lufah wait-until-paused
One could stop the client job after it becomes paused. Or kill -TERM if not using systemd.
-
calxalot
- Site Moderator
- Posts: 1840
- Joined: Sat Dec 08, 2007 1:33 am
- Location: San Francisco, CA
- Contact:
Re: Running FAH 8.x in slurm jobs
@JWhy
I should point out lufah error messages may have changed from what you have in fah-watchdog.sh
Or you should expect such in next version.
I should point out lufah error messages may have changed from what you have in fah-watchdog.sh
Or you should expect such in next version.
Re: Running FAH 8.x in slurm jobs
Thanks for the heads up !