High Throughput Resources

If you're new to FAH and need help getting started or you have very basic questions, start here.

Moderators: Site Moderators, FAHC Science Team

Post Reply
jclark.astro
Posts: 3
Joined: Sat Mar 21, 2020 5:53 pm

High Throughput Resources

Post by jclark.astro »

I'm a member of a large collaboration with access to a respectable number of high throughput computing resources. I've been experimenting and have a small scale batch of CPU jobs now running under HTCondor from a singularity container and I'm now looking at scaling up with a GPU-enabled container.

My current setup is extremely rudimentary: i'm just running multiple instances the command line client and still have my work cut out to really enable periodic checkpointing but for now i'm just running where there's little to no contention.

Couple of questions:
1. Has anyone already addressed this? Don't necessarily want to reinvent the wheel etc. Pointers would be appreciated.
2. I'm getting a lot of "17:02:35:ERROR:WU00:FS00:Exception: Could not get an assignmen" -- I assume this is just demand outstripping supply?
3. Ideally, I would like to engineer a workflow in which I pre-download a bunch of WUs, send them out with the jobs and upload the results afterwards (avoiding wasting cycles waiting for WUs to become available). Is there any support for doing this? Or any recommended contacts?
Jesse_V
Site Moderator
Posts: 2850
Joined: Mon Jul 18, 2011 4:44 am
Hardware configuration: OS: Windows 10, Kubuntu 19.04
CPU: i7-6700k
GPU: GTX 970, GTX 1080 TI
RAM: 24 GB DDR4
Location: Western Washington

Re: High Throughput Resources

Post by Jesse_V »

Hi and welcome to the forum! This is fantastic and thank you so much for the contributions.

1. FAHClient as a processes does periodic checkpointing; by default every 15 minutes but you can set that in the config as well. The software also divides up resources into "folding slots" which can each be configured for particular GPUs or a particular number of CPU cores. This way a single FAHClient instance can manage numerous folding slots over a system with many different resources. Then you can use FAHControl to connect to FAHClient over the network. The software should run fairly well if each GPU folding slot is set for a single GPU, since the workunits don't really scale efficiently across multiple GPUs. I hope that should help you get started.

2. Yes, "could not get an assignment" and "no WUs available for this configuration" both indicate that demand is outstripping demand and there aren't any workunits in the queue for the CPU or GPU folding slots at that exact moment. The research teams are getting more servers online and more projects into the queue to keep up. There was about a 10x increase in the number of F@h users over the past several weeks, so there's been huge load on the servers and the queue of workunits, but I expect that will be balanced out in the next couple of days.

3. FAHClient can do this. There's a configuration for each folding slot called "next-unit-percentage". If set to 90, the minimum, then FAHClient will download another workunit when the previous job gets to 90%. You can set this option to the folding slot through FAHControl or by manually editing the config.xml file, but I recommend using FAHControl if possible.
F@h is now the top computing platform on the planet and nothing unites people like a dedicated fight against a common enemy. This virus affects all of us. Lets end it together.
X-Wing
Posts: 54
Joined: Sat Apr 27, 2019 11:43 pm

Re: High Throughput Resources

Post by X-Wing »

I'm just a normal community member, so I can't help with the first one, but based on what I have seen on the forums over the last few years, they typically don't like pre-downloading (caching) WUs, there are a couple of posts out there about why that is bad, though COVID-19 has changed a lot, so you could always ask. Given the above reply though, I fear I may be misunderstanding your question. I also didn't know about the 90% thing, thanks Jesse!

The error message you're seeing is exactly what you thought, servers are still trying their best, they need just a little more time (there has been a 10x increase in FAH's computing power over the last month from 100~ PFLOPS to 958~ PFLOPs).

Thank you for your contributions, keep up the good work!
Rig: i3-8350K, GTX 1660Ti, GTX 750Ti, 16GB DRR4-3000MHz.
jclark.astro
Posts: 3
Joined: Sat Mar 21, 2020 5:53 pm

Re: High Throughput Resources

Post by jclark.astro »

Great, thank you both, thrilled to have gotten the ball rolling!

What I have are shared resources where I should assume that jobs will not run continuously for more than a couple of hours, before being evicted and moved to another machine.

So, what I'd really like is a single instantiation of FAHClient (xN, independently) which:
1. Uses a configurable number of cores (check! --cpus n)
2. Analyses a single work unit and exits completely with signal 0 when the job is complete
3. Periodic checkpointing: dump results to the local disk after a configurable length of time / number of iterations and, here's the catch, i'd like it to exit with, say, exit code 77. The specific exit code is required so that the workflow management system recognises that the job is checkpointing and should be resumed when it matches with a slot again. I do not expect FAHClient to support this already but i can fudge it easily enough with a wrapper script :)
4. Bring all the data back with the job to the submit location so that it can be shipped back out when the job restarts

What I have in my wrapper so far for the CPU-bound jobs is basically (not sure if markdown is going to work here..):
```
mkdir -p test-job
pushd test-job
FAHClient --cpu-usage 100 --cpus 4 --gpu=False \
--exit-when-done \
--max-queue 1 \
--max-units 1 \
--checkpoint 1
popd
```

I've been experimenting with --cycles to configure an effective checkpoint period but that, of course, means it exits with signal zero and the system thinks it's complete. What I could do easily enough is test for the existence/non-existence of files that would only be there if the job was *incomplete* and exit appropriately.

So, further questions:
1. can you tell me what files I should look for when running with --cycles that would indicate an *incomplete* job? Or any files which would indicate a *complete* job?
2. the last time a WU completed, I don't think the --exit-when-done option actually resulted in the process stopping - it seemed to then just sit there. Is that expected and, if so, is there another native way to run FAHClient in such a way that it starts, does some work, and stops?

Finally, regarding caching: I think X-Wing may have understood my intention/request here -- I'd like to download WUs to the submit location and ship them out with the jobs all in one go. I can understand that's probably not a feature available to the public, though and it should be ok to let each job download it's own data (the throughput isn't overwhelming here).

Thank you so much for your work and answers!
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: High Throughput Resources

Post by _r2w_ben »

AFAIK stopping after a work unit is finished is done at the slot level. The work unit will run to completion, get uploaded, and new work will not be requested. FAHClient continues to run.

In the numbered subfolders within the work directory, a file will be created named wuresults.dat when the work unit finishes. It won't exist for long because it's deleted on successful upload. You could potentially watch for the removal of the numbered subfolder.
Joe_H
Site Admin
Posts: 7943
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: High Throughput Resources

Post by Joe_H »

One addendum to what Jesse_V mentioned about checkpoints, the client can set a value between 3 and 30 minutes. This only applies to folding using the CPU core, and does not apply to folding on a GPU.

For GPU folding the checkpoint frequency is set by the researcher and typically is between 2-5% of the progress. Each project will have its specific checkpoint value set, so all WU's from that project will be the same. The checkpoint frequency may also be the same with a group of similar projects that are examine variants of a specific protein system.

At the checkpoint a GPU core writes out some data needed for the later analysis, runs a sanity check on the progress within a CPU process, and writes the actual checkpoint. Then it continues processing until the next checkpoint.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: High Throughput Resources

Post by bruce »

X-Wing wrote:I'm just a normal community member, so I can't help with the first one, but based on what I have seen on the forums over the last few years, they typically don't like pre-downloading (caching) WUs, there are a couple of posts out there about why that is bad, though COVID-19 has changed a lot, so you could always ask. Given the above reply though, I fear I may be misunderstanding your question. I also didn't know about the 90% thing, thanks Jesse!
WU caching is still prohibited both by fiat and by the Quick Return Bonus. A WU which has been cached is worth significantly less that the same WU which is returnd more quickly. The server note the time a WU is issue and subract it from the time it's returned and a significant amount of the bonus is based on minimizing that total time. Even the 90% early distribution has a negative effect on the points you earn.

Then, too, the servers are bandwidth limited almost 24hrs per day. It would be pretty tricky to be able to pre-download WUs when there's a minor window in the middle of the night -- and the upload of the result faces the same problem of a saturated bandwidth. (Search the forum for "slow uploads")
jclark.astro
Posts: 3
Joined: Sat Mar 21, 2020 5:53 pm

Re: High Throughput Resources

Post by jclark.astro »

Thank you again for the notes.

Bit of progress: jobs are running and completing ok but uploads fail with:
01:39:40:WU00:FS00:Connecting to 155.247.166.219:8080
01:39:40:All slots are done, exiting
01:39:40:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
01:39:41:Clean exit

I don't know how FAHClient wants to connect but fwiw I cannot ping from inside the container with the current settings:
Singularity> ping 155.247.166.219
ping: socket: Operation not permitted

I assume these are related issues - look familiar to anyone?
gw666
Posts: 14
Joined: Thu Apr 09, 2020 8:53 am

Re: High Throughput Resources

Post by gw666 »

jclark.astro wrote:Thank you again for the notes.

Bit of progress: jobs are running and completing ok but uploads fail with:
01:39:40:WU00:FS00:Connecting to 155.247.166.219:8080
01:39:40:All slots are done, exiting
01:39:40:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
01:39:41:Clean exit

I don't know how FAHClient wants to connect but fwiw I cannot ping from inside the container with the current settings:
Singularity> ping 155.247.166.219
ping: socket: Operation not permitted

I assume these are related issues - look familiar to anyone?
Ping inside the container is the wrong test, the ping binary needs CAP_NET_ADMIN or the s bit, so unless you start your singularity container as root, ping cannot be used.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: High Throughput Resources

Post by bruce »

How about opening http(s)://155.247.166.219? Can you see the server's landing page?
HaloJones
Posts: 906
Joined: Thu Jul 24, 2008 10:16 am

Re: High Throughput Resources

Post by HaloJones »

ping also is unlikely to get all the way to the work servers.

being able to download multiple units to a local work server would be great for your ability to keep your servers working but would be terrible for the science. never going to be allowed, sorry.
single 1070

Image
Post Reply