Page 1 of 1

Hosting and new servers spin up time

Posted: Mon Mar 16, 2020 10:08 pm
by codecaine
Hello,

I work in devOps with a particular focus on high volatility workloads, and it seems I could really help FAH in this with both AS and WS issues. I would love to volunteer some time to help the WU's get promptly assigned and collected given the large influx we have seen for COVID19, and people are blaming the infrastructure somewhat. @Nathan_P said that it's taking days to spin up new servers (viewtopic.php?f=16&t=32551), and this in my experience is extremely high. I would expect this to be around 3 hours for just a server image, and that the slowness is likely due to not having a codified infrastructure on something like Terraform (Also can use Ansible, Chef, Puppet). I see here (viewtopic.php?f=38&t=32641) that one of the AS systems is already in the cloud. Once system images are created they can spin online in a matter of 1 - 2 minutes.

I honestly believe I can setup all of this for a new server and also creating an Ansible script (which can be used even for offline and totally independent networks) in 12 - 24 hours in AWS and that system would remain cost effective, and could auto scale up to 50,000 users or more requests in 1 - 2 minutes. To give you my background I do this type of work for a company, I'm very interested in FAH, and the company I work for needs these types of networks for things like "sales" during on air TV promotions to a website. These systems will go from a normal 100 user workload to 10's of thousands users in a minute, and then back down to normal size (if requests drop off) within 2 hours (tapering down every 5 to 10 minutes). I think FAH could use this kind of setup based that I see "10,771.42/hr" type of assign rates from https://apps.foldingathome.org/serverstats. When the work is done, it scales back down to small systems based on requests coming in to save money. To be honest, there are 3 systems reporting close to 10,800 assigns per hour, and there are users still reporting hours going without GPU operations. This means that it's not scaling to what you need, and could probably use somewhere in the range of 50,000 to 75,000 assigns per hour. The problem mostly is we shouldn't construct server systems to always run at this high number of operations 24/7. When a burst of operations happens, it should be able to scale up to this to handle the influx, and then scale back down to normal operations once complete to be cost effective.

Copied info from: viewtopic.php?f=16&t=32551 for clarity
Requirements mentioned from @Nathan_P:
Fast I/O - check - uses SSD drives (don't need 100TB here, see last item)
Fast network connection - check - depends on server size, and also they have specific instances for high network workloads such as this one. Take for example the "c5n" EC2 instance
Tons of fast, scalable and highly available storage - check - can mount an EFS drive using NFS 4.1 protocol, and have it do the encryption at rest, and TLS for in transit. Also, can apply a policy to help with storage to save on some cost based on when it was last used.

@toTOW responded viewtopic.php?p=312968#p312968 mentioning that work generation is higher priority than serving work, and the errors of connection does not mean there is an issue as it's likely generating work units. I understand the point here, but also on the server stats page https://apps.foldingathome.org/serverstats there are still 4,000 WUs per project that can be assigned out as well. I would like to learn more about these requirements.

Cheers,

P.S. - Please don't read into this thinking I'm trying to promote the company I work for. This is science, and if people have GPU's going unused for hours I don't like the response "getting better slowly". I'm patient, but also know how to fix this problem. Science and a great FAH experience for the influx of FAH users and helping solve COVID19 in addition to other diseases is what I care about. It just so happens that I'm good at this one thing FAH seems to need, and want to tell you why it is a better way with an example to back it up.

Re: Hosting and new servers spin up time

Posted: Mon Mar 16, 2020 11:09 pm
by davidcoton
A bit of background. F@H comes from an academic biomedical setting, where professional programming resources are not always affordable. The server infrastructure has been adequate for several years, with each new project able to spin up its server and integrate it into the overall project. Suddenly F@H has (for good reasons) been heavily promoted over several channels, so the number of clients seeking work has spiked to an unprecedented and unpredicted level. To keep up, F@H needs to mobilise new hardware, deploy servers, configure them for COVID19 projects, and generate the work units. It is not as easy an operation as it could be, because there has been little investment in making it easy (always there were higher priority tasks). There are discussions about acquiring cloud based resources (with limited or no budget), and long working hours to prep new projects.

It will all happen, but not overnight. Two nights might do it ;). While I'm sure those involved will appreciate your offer and all the others, they are just a bit busy at present and may not be able to respond quickly.

Re: Hosting and new servers spin up time

Posted: Wed Mar 18, 2020 1:56 pm
by codecaine
Good to know about the background. I'm here to volunteer some time to help make this easier. Even if FAH doesn't take all of my suggestions or let me do that (maybe for security reasons - who knows) the suggestions still apply, and are cost effective. I want to help to make it easy. Things like IT server setup are probably of little interest to the researchers. A responsive system can go a long way, and I don't mean "responsive" as just "super fast".

Re: Hosting and new servers spin up time

Posted: Wed Mar 18, 2020 3:23 pm
by codecaine
Even if you wanted to just stick to what you're doing now the #1 one thing that could cut 80% of workload for a setup is Ansible. All the teams could use this, it's python so not additional scripts are needed to run it. OS like Ubuntu and others have this in their core. I'd be willing to write the Ansible script for FAH to use to cut server updates from 24 - 48 hours to 1 - 3 hours. It makes the process much easier, and it won't matter to which server you want to apply it to. Can be used even on servers not in a cloud provider (which is what FAH is doing now), and it's free!

Re: Hosting and new servers spin up time

Posted: Fri Mar 20, 2020 12:36 am
by toTOW
Thanks for your input on the subject. I asked for an official answer, I hope someone will get in touch with you.

To complete your input data, here's is what takes most of the time in the situation we have now :
- getting new server hardware : finding funds or partners that will provide free resources is a long and complicated process.
- setting up projects and validating them also take a considerable amount of time. I'm not talking about generating WUs, but about building simulation and equilibrating it so that it can "live".

Re: Hosting and new servers spin up time

Posted: Fri Mar 20, 2020 3:03 am
by sukritsingh
Hi! I appreciate your support and patience, and thanks so much with your offer and advice for how we could configure things! If you want to connect more about what specifically you might contribute, shoot me an email at sukrit.singh@wustl.edu and we can connect and discuss!

Re: Hosting and new servers spin up time

Posted: Sat Mar 21, 2020 3:09 pm
by codecaine
Thank you @toTOW and @sukritsingh.

I believe I can help with a good portion, but maybe not all since I'm not a bio chemist. Would love to learn more about setup and validations of the simulations.
With the IT and cloud hosting, I could be of value still I believe even though there are more people are consistently folding. It's a good thing that we see less posts about this.

@sukritsingh
I sent you an email. Let's talk.