Running FAHclient on a cloud resouses on temporary VMs

Gavelock · Post by **Gavelock** » Thu May 07, 2020 9:44 pm

Hello!

I work in an organization. We have a private cloud for our business needs.
The problem is that the cloud is never 100% busy, there are always some resources available: 40-100 CPU cores.
We have an idea to use these "free" resources for Folding@home.
The Operating System on VMs is CentOS7.
Unfortunately, we can allow VMs to run for a week continuously.
What we can do is to create a new VM, complete exactly one work unit, and delete the VM. If there are still free resources available: repeat.
What I need right now is a command like:
FAHClient --amount-of-workunits=1 --user=username --team=12345 --passkey=***** --gpu=false --cpu-usage=100

This command should request a work unit and when the one is done, finish with exit code 0.
I did not find anything like that in FAHClient help. I tried cycles option but it is different.

Basically, there are two questions:
1. Is that use case with a cloud useful for the Folding@home project? (VMs created in the cloud and removed after one work unit is done)
2. If the first answer is yes, how can we restrict the number of work units done by FAHClient during one run?

Post by **PantherX** » Fri May 08, 2020 5:18 am

Welcome to the F@H Forum Gavelock,

This is the command that would work for you:

Code: Select all

  max-units <integer=0>
    Process at most this number of units, then pause.

You can experiment with the value, say 3 WUs which could potentially be finished within 1 Week assuming that it runs 24/7 and has multiple CPUs to fold.

Since you're using company owned hardware, please ensure that you have permission (usually written) from the person authorized to make such decisions (Internal IT, CTO, etc.). Folding on CPUs is valuable and important scientific work so whatever your business can contribute towards, it would be appreciated

Gavelock · Post by **Gavelock** » Fri May 08, 2020 12:47 pm

Hi PantherX,

Thank you for your answer!
I have started a test run with max-units parameter.
The company is interested in participating in helping COVID-19 research projects. Right now it is just a request to study the possibility to participate in F@H. Once it is done I hope we will run real campaigns.

MeeLee · Post by **MeeLee** » Fri May 08, 2020 2:39 pm

I think it'll be better to run a script on your servers to pause the VMs, once your resources are less than x-amount of threads.
Run one major VM running FAH on multiple cores, and run a few smaller ones that you can easily pause (like running 4 to 8 cores).

That way you don't have to set up and reload each VM.
As long as the (average) WU is able to continue within ~8-14 hours (on average hardware of ~3Ghz quad core or more), it should make the deadline.

[WHGT]Cyberman · Post by **[WHGT]Cyberman** » Mon May 11, 2020 5:07 am

Instead of a full VM just for FAH, you could also run FAH inside a docker container that runs inside any other VM.
Probably somewhat less efficient, but much less work to set up every time.
There's several dockerfiles on dockerhub to use as inspiration.

Post by **PantherX** » Mon May 11, 2020 7:09 am

If you're planning on using Docker, have a look here (https://github.com/FoldingAtHome/containers). If you're planning to use VMWare, then have a look here (https://flings.vmware.com/vmware-applia ... lding-home). Please note that the VMWare appliance isn't officially supported.

Post by **bruce** » Tue May 12, 2020 2:12 am

See also enhancement suggestion: https://github.com/FoldingAtHome/fah-issues/issues/1474

Catalina588 · Post by **Catalina588** » Tue May 12, 2020 2:37 pm

See also issue with managing preemptible VMs https://github.com/FoldingAtHome/fah-issues/issues/1458

PeterGarlic · Post by **PeterGarlic** » Tue May 12, 2020 3:29 pm

Gavelock wrote:Hello!

I work in an organization. We have a private cloud for our business needs.
The problem is that the cloud is never 100% busy, there are always some resources available: 40-100 CPU cores.
We have an idea to use these "free" resources for Folding@home.
The Operating System on VMs is CentOS7.
Unfortunately, we can allow VMs to run for a week continuously.
What we can do is to create a new VM, complete exactly one work unit, and delete the VM. If there are still free resources available: repeat.
What I need right now is a command like:
FAHClient --amount-of-workunits=1 --user=username --team=12345 --passkey=***** --gpu=false --cpu-usage=100

This command should request a work unit and when the one is done, finish with exit code 0.
I did not find anything like that in FAHClient help. I tried cycles option but it is different.

Basically, there are two questions:
1. Is that use case with a cloud useful for the Folding@home project? (VMs created in the cloud and removed after one work unit is done)
2. If the first answer is yes, how can we restrict the number of work units done by FAHClient during one run?

Hi Gavelock,
I have a similar situation and I would like to ask you if is possible to know what configuration are you using for your VMs (vCPU, Ram, Disk).
We are testing private cloud deployment (KVM clusters) as you and next step is to find the best VM configuration for maximum performances.
Thanks in advance

Post by **PantherX** » Wed May 13, 2020 7:41 am

The most stable CPU values are: 2, 4, 8, 12, 16 while RAM would what the OS needs plus a bit more as F@H isn't RAM intensive on CPU folding only. For storage, a fast one means less time writing checkpoints but F@H isn't disk heavy, only when reading/writing checkpoints and packing/unpack WUs to be sent/received.

Neil-B · Post by **Neil-B** » Wed May 13, 2020 7:50 am

24 and 32 are also pretty rock solid so if your VMs are scalable to that then these will complete WUs much faster - dependant on underlying hardware and the specific project probably in the 45mins to 4hours window.

Gavelock · Post by **Gavelock** » Thu Sep 03, 2020 2:53 pm

PeterGarlic wrote: Hi Gavelock,
I have a similar situation and I would like to ask you if is possible to know what configuration are you using for your VMs (vCPU, Ram, Disk).
We are testing private cloud deployment (KVM clusters) as you and next step is to find the best VM configuration for maximum performances.
Thanks in advance

Hello PeterGarlic,

We are using 1 CPU core, 4 GB RAM, 15 GB disks, CentOS7 for that task. That was done to allow filling even the smallest pieces of free CPU resources.

Kind regards,

Gavelock · Post by **Gavelock** » Thu Sep 03, 2020 3:24 pm

PantherX wrote:Welcome to the F@H Forum Gavelock,

This is the command that would work for you:
Code: Select all
  max-units <integer=0>
    Process at most this number of units, then pause.
You can experiment with the value, say 3 WUs which could potentially be finished within 1 Week assuming that it runs 24/7 and has multiple CPUs to fold.

Since you're using company owned hardware, please ensure that you have permission (usually written) from the person authorized to make such decisions (Internal IT, CTO, etc.). Folding on CPUs is valuable and important scientific work so whatever your business can contribute towards, it would be appreciated

Hello PantherX,

Thanks again for your help with FAHClient. I would like to share some information about our solution.
In the institute (Joint Institute for Nuclear Research) we have a cloud. Some other members of our institute also have clouds. These clouds are partially used to run batch jobs on either dedicated resources or on free ones. The batch job here is a shell script that should be executed. All clouds are joint together with DIRAC Interware. It is some special opensource platform used in science to organize distributed heterogeneous systems to run High Throughput Computing load through them. When jobs for cloud resources appear, DIRAC spawns VMs on available clouds. Each VM after contextualization ask the central DIRAC service for one job. DIRAC sends to each VM one job from queue. When the job is done on the VM, that VM asks DIRAC to delete itself(delete VM). If there are still jobs in the job queue, DIRAC will try to spawn new VMs on the freed resources.

So the task was to create shell script to run FAHClient as a job which will finish after the FaH work unit is completed. The shell script for the job is super simple:

Code: Select all

#!/bin/bash
set -x
echo $1
echo $2 
FAHClient --cause=covid-19 --user=$1 --team=265602 --passkey=$2 --gpu=false --cycles=-1 --cpu-usage=100 --exit-when-done --max-units=1

Another part is a program sending jobs to the DIRAC Job Queue, but that is closely related to DIRAC API so I will not post it here. This program checks the status of queues and resources and sends FaH jobs to the queue. Each job goes with parameters depending on the resource on which it will be run. Parameters contain FaH Username and FaH Passkey. That allows keeping track of each cloud in the joint infrastructure.

Our team is Joint Institute for Nuclear Research, ID: 265602. It's been 3 months since the start of this activity. The team has rank around 7000, 23M credits received(https://stats.foldingathome.org/team/265602). And we are happy that idle resources are used now for good cause.

Thank you PantherX for your help!
Thanks, everybody for reaction on this thread! That was a surprise for me when I came here today. I found very interesting suggestions and ideas.

Kind regards,
Igor Pelevanyuk

gunnarre · Post by **gunnarre** » Fri Sep 04, 2020 8:20 am

Thank you for helping out with the research.

The advantage to running one WU at a time like you do - compared to running it as an interruptible instance - is that you'll more likely complete the work within the timeout. So that is preferable if the alternative is to have the instance paused for days - which might cause the effort to be wasted.

On a very heterogenous platform, doing it like you have with just one CPU thread per instance may indeed be the way to go, but also note that CPU folding takes good advantage of multi-threading - so for most VM hosts it might be better to run just one or a few instances with say 8 or 16 threads on low priority to use idle resources.

Neil-B · Post by **Neil-B** » Fri Sep 04, 2020 11:44 am

... so to consider ... whilst a multi core vm will tie up more cores it will do it for a shorter time ... even just 2 or 4 cores will significantly assist the science by returning WUs quicker than 2x or 4x single core vms

Folding Forum

Running FAHclient on a cloud resouses on temporary VMs

Running FAHclient on a cloud resouses on temporary VMs

Re: Running FAHclient on a cloud resouses on temporary VMs

Re: Running FAHclient on a cloud resouses on temporary VMs

Re: Running FAHclient on a cloud resouses on temporary VMs

Re: Running FAHclient on a cloud resouses on temporary VMs

Re: Running FAHclient on a cloud resouses on temporary VMs

Re: Running FAHclient on a cloud resouses on temporary VMs

Re: Running FAHclient on a cloud resouses on temporary VMs

Re: Running FAHclient on a cloud resouses on temporary VMs

Re: Running FAHclient on a cloud resouses on temporary VMs

Re: Running FAHclient on a cloud resouses on temporary VMs

Re: Running FAHclient on a cloud resouses on temporary VMs

Re: Running FAHclient on a cloud resouses on temporary VMs

Re: Running FAHclient on a cloud resouses on temporary VMs

Re: Running FAHclient on a cloud resouses on temporary VMs