High Throughput Resources

jclark.astro · Post by **jclark.astro** » Sat Mar 21, 2020 6:50 pm

I'm a member of a large collaboration with access to a respectable number of high throughput computing resources. I've been experimenting and have a small scale batch of CPU jobs now running under HTCondor from a singularity container and I'm now looking at scaling up with a GPU-enabled container.

My current setup is extremely rudimentary: i'm just running multiple instances the command line client and still have my work cut out to really enable periodic checkpointing but for now i'm just running where there's little to no contention.

Couple of questions:
1. Has anyone already addressed this? Don't necessarily want to reinvent the wheel etc. Pointers would be appreciated.
2. I'm getting a lot of "17:02:35:ERROR:WU00:FS00:Exception: Could not get an assignmen" -- I assume this is just demand outstripping supply?
3. Ideally, I would like to engineer a workflow in which I pre-download a bunch of WUs, send them out with the jobs and upload the results afterwards (avoiding wasting cycles waiting for WUs to become available). Is there any support for doing this? Or any recommended contacts?

Post by **Jesse_V** » Sat Mar 21, 2020 7:13 pm

Hi and welcome to the forum! This is fantastic and thank you so much for the contributions.

1. FAHClient as a processes does periodic checkpointing; by default every 15 minutes but you can set that in the config as well. The software also divides up resources into "folding slots" which can each be configured for particular GPUs or a particular number of CPU cores. This way a single FAHClient instance can manage numerous folding slots over a system with many different resources. Then you can use FAHControl to connect to FAHClient over the network. The software should run fairly well if each GPU folding slot is set for a single GPU, since the workunits don't really scale efficiently across multiple GPUs. I hope that should help you get started.

2. Yes, "could not get an assignment" and "no WUs available for this configuration" both indicate that demand is outstripping demand and there aren't any workunits in the queue for the CPU or GPU folding slots at that exact moment. The research teams are getting more servers online and more projects into the queue to keep up. There was about a 10x increase in the number of F@h users over the past several weeks, so there's been huge load on the servers and the queue of workunits, but I expect that will be balanced out in the next couple of days.

3. FAHClient can do this. There's a configuration for each folding slot called "next-unit-percentage". If set to 90, the minimum, then FAHClient will download another workunit when the previous job gets to 90%. You can set this option to the folding slot through FAHControl or by manually editing the config.xml file, but I recommend using FAHControl if possible.

X-Wing · Post by **X-Wing** » Sat Mar 21, 2020 7:20 pm

I'm just a normal community member, so I can't help with the first one, but based on what I have seen on the forums over the last few years, they typically don't like pre-downloading (caching) WUs, there are a couple of posts out there about why that is bad, though COVID-19 has changed a lot, so you could always ask. Given the above reply though, I fear I may be misunderstanding your question. I also didn't know about the 90% thing, thanks Jesse!

The error message you're seeing is exactly what you thought, servers are still trying their best, they need just a little more time (there has been a 10x increase in FAH's computing power over the last month from 100~ PFLOPS to 958~ PFLOPs).

Thank you for your contributions, keep up the good work!

jclark.astro · Post by **jclark.astro** » Sat Mar 21, 2020 8:25 pm

Great, thank you both, thrilled to have gotten the ball rolling!

What I have are shared resources where I should assume that jobs will not run continuously for more than a couple of hours, before being evicted and moved to another machine.

So, what I'd really like is a single instantiation of FAHClient (xN, independently) which:
1. Uses a configurable number of cores (check! --cpus n)
2. Analyses a single work unit and exits completely with signal 0 when the job is complete
3. Periodic checkpointing: dump results to the local disk after a configurable length of time / number of iterations and, here's the catch, i'd like it to exit with, say, exit code 77. The specific exit code is required so that the workflow management system recognises that the job is checkpointing and should be resumed when it matches with a slot again. I do not expect FAHClient to support this already but i can fudge it easily enough with a wrapper script

4. Bring all the data back with the job to the submit location so that it can be shipped back out when the job restarts

What I have in my wrapper so far for the CPU-bound jobs is basically (not sure if markdown is going to work here..):
```
mkdir -p test-job
pushd test-job
FAHClient --cpu-usage 100 --cpus 4 --gpu=False \
--exit-when-done \
--max-queue 1 \
--max-units 1 \
--checkpoint 1
popd
```

I've been experimenting with --cycles to configure an effective checkpoint period but that, of course, means it exits with signal zero and the system thinks it's complete. What I could do easily enough is test for the existence/non-existence of files that would only be there if the job was *incomplete* and exit appropriately.

So, further questions:
1. can you tell me what files I should look for when running with --cycles that would indicate an *incomplete* job? Or any files which would indicate a *complete* job?
2. the last time a WU completed, I don't think the --exit-when-done option actually resulted in the process stopping - it seemed to then just sit there. Is that expected and, if so, is there another native way to run FAHClient in such a way that it starts, does some work, and stops?

Finally, regarding caching: I think X-Wing may have understood my intention/request here -- I'd like to download WUs to the submit location and ship them out with the jobs all in one go. I can understand that's probably not a feature available to the public, though and it should be ok to let each job download it's own data (the throughput isn't overwhelming here).

Thank you so much for your work and answers!

_r2w_ben · Post by **_r2w_ben** » Sat Mar 21, 2020 8:54 pm

AFAIK stopping after a work unit is finished is done at the slot level. The work unit will run to completion, get uploaded, and new work will not be requested. FAHClient continues to run.

In the numbered subfolders within the work directory, a file will be created named wuresults.dat when the work unit finishes. It won't exist for long because it's deleted on successful upload. You could potentially watch for the removal of the numbered subfolder.

Post by **Joe_H** » Sat Mar 21, 2020 10:16 pm

One addendum to what Jesse_V mentioned about checkpoints, the client can set a value between 3 and 30 minutes. This only applies to folding using the CPU core, and does not apply to folding on a GPU.

For GPU folding the checkpoint frequency is set by the researcher and typically is between 2-5% of the progress. Each project will have its specific checkpoint value set, so all WU's from that project will be the same. The checkpoint frequency may also be the same with a group of similar projects that are examine variants of a specific protein system.

At the checkpoint a GPU core writes out some data needed for the later analysis, runs a sanity check on the progress within a CPU process, and writes the actual checkpoint. Then it continues processing until the next checkpoint.

Post by **bruce** » Sun Mar 22, 2020 5:17 am

X-Wing wrote:I'm just a normal community member, so I can't help with the first one, but based on what I have seen on the forums over the last few years, they typically don't like pre-downloading (caching) WUs, there are a couple of posts out there about why that is bad, though COVID-19 has changed a lot, so you could always ask. Given the above reply though, I fear I may be misunderstanding your question. I also didn't know about the 90% thing, thanks Jesse!

WU caching is still prohibited both by fiat and by the Quick Return Bonus. A WU which has been cached is worth significantly less that the same WU which is returnd more quickly. The server note the time a WU is issue and subract it from the time it's returned and a significant amount of the bonus is based on minimizing that total time. Even the 90% early distribution has a negative effect on the points you earn.

Then, too, the servers are bandwidth limited almost 24hrs per day. It would be pretty tricky to be able to pre-download WUs when there's a minor window in the middle of the night -- and the upload of the result faces the same problem of a saturated bandwidth. (Search the forum for "slow uploads")

jclark.astro · Post by **jclark.astro** » Tue Mar 24, 2020 1:36 pm

Thank you again for the notes.

Bit of progress: jobs are running and completing ok but uploads fail with:
01:39:40:WU00:FS00:Connecting to 155.247.166.219:8080
01:39:40:All slots are done, exiting
01:39:40:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
01:39:41:Clean exit

I don't know how FAHClient wants to connect but fwiw I cannot ping from inside the container with the current settings:
Singularity> ping 155.247.166.219
ping: socket: Operation not permitted

I assume these are related issues - look familiar to anyone?

gw666 · Post by **gw666** » Wed Apr 15, 2020 6:51 am

jclark.astro wrote:Thank you again for the notes.

Bit of progress: jobs are running and completing ok but uploads fail with:
01:39:40:WU00:FS00:Connecting to 155.247.166.219:8080
01:39:40:All slots are done, exiting
01:39:40:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
01:39:41:Clean exit

I don't know how FAHClient wants to connect but fwiw I cannot ping from inside the container with the current settings:
Singularity> ping 155.247.166.219
ping: socket: Operation not permitted

I assume these are related issues - look familiar to anyone?

Ping inside the container is the wrong test, the ping binary needs CAP_NET_ADMIN or the s bit, so unless you start your singularity container as root, ping cannot be used.

Post by **bruce** » Wed Apr 15, 2020 6:31 pm

How about opening http(s)://155.247.166.219? Can you see the server's landing page?

HaloJones · Post by **HaloJones** » Wed Apr 15, 2020 8:15 pm

ping also is unlikely to get all the way to the work servers.

being able to download multiple units to a local work server would be great for your ability to keep your servers working but would be terrible for the science. never going to be allowed, sorry.

Folding Forum

High Throughput Resources

High Throughput Resources

Re: High Throughput Resources

Re: High Throughput Resources

Re: High Throughput Resources

Re: High Throughput Resources

Re: High Throughput Resources

Re: High Throughput Resources

Re: High Throughput Resources

Re: High Throughput Resources

Re: High Throughput Resources

Re: High Throughput Resources