Hosting and new servers spin up time
Posted: Mon Mar 16, 2020 10:08 pm
Hello,
I work in devOps with a particular focus on high volatility workloads, and it seems I could really help FAH in this with both AS and WS issues. I would love to volunteer some time to help the WU's get promptly assigned and collected given the large influx we have seen for COVID19, and people are blaming the infrastructure somewhat. @Nathan_P said that it's taking days to spin up new servers (viewtopic.php?f=16&t=32551), and this in my experience is extremely high. I would expect this to be around 3 hours for just a server image, and that the slowness is likely due to not having a codified infrastructure on something like Terraform (Also can use Ansible, Chef, Puppet). I see here (viewtopic.php?f=38&t=32641) that one of the AS systems is already in the cloud. Once system images are created they can spin online in a matter of 1 - 2 minutes.
I honestly believe I can setup all of this for a new server and also creating an Ansible script (which can be used even for offline and totally independent networks) in 12 - 24 hours in AWS and that system would remain cost effective, and could auto scale up to 50,000 users or more requests in 1 - 2 minutes. To give you my background I do this type of work for a company, I'm very interested in FAH, and the company I work for needs these types of networks for things like "sales" during on air TV promotions to a website. These systems will go from a normal 100 user workload to 10's of thousands users in a minute, and then back down to normal size (if requests drop off) within 2 hours (tapering down every 5 to 10 minutes). I think FAH could use this kind of setup based that I see "10,771.42/hr" type of assign rates from https://apps.foldingathome.org/serverstats. When the work is done, it scales back down to small systems based on requests coming in to save money. To be honest, there are 3 systems reporting close to 10,800 assigns per hour, and there are users still reporting hours going without GPU operations. This means that it's not scaling to what you need, and could probably use somewhere in the range of 50,000 to 75,000 assigns per hour. The problem mostly is we shouldn't construct server systems to always run at this high number of operations 24/7. When a burst of operations happens, it should be able to scale up to this to handle the influx, and then scale back down to normal operations once complete to be cost effective.
Copied info from: viewtopic.php?f=16&t=32551 for clarity
Requirements mentioned from @Nathan_P:
Fast I/O - check - uses SSD drives (don't need 100TB here, see last item)
Fast network connection - check - depends on server size, and also they have specific instances for high network workloads such as this one. Take for example the "c5n" EC2 instance
Tons of fast, scalable and highly available storage - check - can mount an EFS drive using NFS 4.1 protocol, and have it do the encryption at rest, and TLS for in transit. Also, can apply a policy to help with storage to save on some cost based on when it was last used.
@toTOW responded viewtopic.php?p=312968#p312968 mentioning that work generation is higher priority than serving work, and the errors of connection does not mean there is an issue as it's likely generating work units. I understand the point here, but also on the server stats page https://apps.foldingathome.org/serverstats there are still 4,000 WUs per project that can be assigned out as well. I would like to learn more about these requirements.
Cheers,
P.S. - Please don't read into this thinking I'm trying to promote the company I work for. This is science, and if people have GPU's going unused for hours I don't like the response "getting better slowly". I'm patient, but also know how to fix this problem. Science and a great FAH experience for the influx of FAH users and helping solve COVID19 in addition to other diseases is what I care about. It just so happens that I'm good at this one thing FAH seems to need, and want to tell you why it is a better way with an example to back it up.
I work in devOps with a particular focus on high volatility workloads, and it seems I could really help FAH in this with both AS and WS issues. I would love to volunteer some time to help the WU's get promptly assigned and collected given the large influx we have seen for COVID19, and people are blaming the infrastructure somewhat. @Nathan_P said that it's taking days to spin up new servers (viewtopic.php?f=16&t=32551), and this in my experience is extremely high. I would expect this to be around 3 hours for just a server image, and that the slowness is likely due to not having a codified infrastructure on something like Terraform (Also can use Ansible, Chef, Puppet). I see here (viewtopic.php?f=38&t=32641) that one of the AS systems is already in the cloud. Once system images are created they can spin online in a matter of 1 - 2 minutes.
I honestly believe I can setup all of this for a new server and also creating an Ansible script (which can be used even for offline and totally independent networks) in 12 - 24 hours in AWS and that system would remain cost effective, and could auto scale up to 50,000 users or more requests in 1 - 2 minutes. To give you my background I do this type of work for a company, I'm very interested in FAH, and the company I work for needs these types of networks for things like "sales" during on air TV promotions to a website. These systems will go from a normal 100 user workload to 10's of thousands users in a minute, and then back down to normal size (if requests drop off) within 2 hours (tapering down every 5 to 10 minutes). I think FAH could use this kind of setup based that I see "10,771.42/hr" type of assign rates from https://apps.foldingathome.org/serverstats. When the work is done, it scales back down to small systems based on requests coming in to save money. To be honest, there are 3 systems reporting close to 10,800 assigns per hour, and there are users still reporting hours going without GPU operations. This means that it's not scaling to what you need, and could probably use somewhere in the range of 50,000 to 75,000 assigns per hour. The problem mostly is we shouldn't construct server systems to always run at this high number of operations 24/7. When a burst of operations happens, it should be able to scale up to this to handle the influx, and then scale back down to normal operations once complete to be cost effective.
Copied info from: viewtopic.php?f=16&t=32551 for clarity
Requirements mentioned from @Nathan_P:
Fast I/O - check - uses SSD drives (don't need 100TB here, see last item)
Fast network connection - check - depends on server size, and also they have specific instances for high network workloads such as this one. Take for example the "c5n" EC2 instance
Tons of fast, scalable and highly available storage - check - can mount an EFS drive using NFS 4.1 protocol, and have it do the encryption at rest, and TLS for in transit. Also, can apply a policy to help with storage to save on some cost based on when it was last used.
@toTOW responded viewtopic.php?p=312968#p312968 mentioning that work generation is higher priority than serving work, and the errors of connection does not mean there is an issue as it's likely generating work units. I understand the point here, but also on the server stats page https://apps.foldingathome.org/serverstats there are still 4,000 WUs per project that can be assigned out as well. I would like to learn more about these requirements.
Cheers,
P.S. - Please don't read into this thinking I'm trying to promote the company I work for. This is science, and if people have GPU's going unused for hours I don't like the response "getting better slowly". I'm patient, but also know how to fix this problem. Science and a great FAH experience for the influx of FAH users and helping solve COVID19 in addition to other diseases is what I care about. It just so happens that I'm good at this one thing FAH seems to need, and want to tell you why it is a better way with an example to back it up.