Connor_Horak wrote:It seems some of my projects are not sending and I am not receiving new projects. I have started a new team (243339) and I have processed twice as many Work Units but have half the points of the other team member. I have a significant leg up on him in terms of both CPU and GPU processing power. Is everything just overloaded because of COVID-19 or is there something I can do to increase my machines productivity?
Unless you see something other than "Failed to get assignment from 'xxx': No WUs available for this configuration" in your logs, I'd say that's it.
Also note that the stats servers are quite behind on the data that they show.
For the log you post we do not need the entire log. If you post the beginning section, about 200 lines, which shows the software, hardware, and folding setup info along with a section that shows the the problem, that is enough.
There have been some problems with the stats keeping up, a big kick was given to the server yesterday evening to get some things going again. Still may be some delay in posting. Other question is if you have a passkey entered into the client, that does affect the point.
Beyond that , yes everything is just overloaded. The request rate for WU's was less than 10,000 an hour two weeks ago, at times now the rate of sending out work can get to 100,000 an hour, there are more requests that can't be filled. They are working on more servers being available, but that takes a bit of time to get set up.
And yes, you can use the same user ID and passkey on multiple machines.
Connor_Horak wrote:It seems some of my projects are not sending and I am not receiving new projects. I have started a new team (243339) and I have processed twice as many Work Units but have half the points of the other team member. I have a significant leg up on him in terms of both CPU and GPU processing power. Is everything just overloaded because of COVID-19 or is there something I can do to increase my machines productivity?
Unless you see something other than "Failed to get assignment from 'xxx': No WUs available for this configuration" in your logs, I'd say that's it.
Also note that the stats servers are quite behind on the data that they show.
I kind of figured
22:15:23:WARNING:WU03:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
22:15:23:ERROR:WU03:FS00:Exception: Could not get an assignment
fangfufu wrote:Have you guys thought about contacting Linus Tech Tip? They are literally holding a competition for running Folding@Home. Linus himself has a few server with 100TB storage space.
fangfufu wrote:Have you guys thought about contacting Linus Tech Tip? They are literally holding a competition for running Folding@Home. Linus himself has a few server with 100TB storage space.
I'm happy to donate server resources. I can host a work unit server, for instance, on a machine with 40 TB of space and a gigabit of bandwidth. I have quite a few servers which can handle large amounts of email. I understand that both the foldingforum org email and the passkey email have had issues. I'd be happy to host and/or help with email servers.
I don't see any discussions about offering these kinds of services, though... Is this an appropriate place? Is another place more appropriate?
sukritsingh wrote:Hi all! If you are interested in contributing hardware or server side power to our effort, please reach out to me at sukrit.singh@wustl.edu.
Generally the minimum requirements we have for setting up a work server are:
Debian Linux machines with 8GiB RAM, 100TiB SSD storage, and 1GiB/s network.
If you think you could help out please send me an email! Thanks so much for your support and help!
Oof. I only have 1Gb/s and can't afford that much SSD storage. A gibibyte is about 8.59 gigabits, right? Maybe I should sell my car and ask my ISP for 8 additional connections
I'm not sure what parts of sukritsingh's post were typos but my first though was to SSD space. For consumer SSD's it's around $100/TB so for 100TB of SSDs, it would run around $10,000. Plus, you would need the servers to support the roughly 100 SATA connections (Say, 4 SANs?). With Azure you would need 4 of their E80 (32 TB Standard SSDs) at a cost of $9830 a month.
Other than that, I have a few dozen TB of space and a 1Gbps connection.
I'm sure sukritsingh is overworked right now! If the reqs are Debian Linux machine with 8GiB RAM, 100GB SSD storage, and 1Gb/s network, I could see quite a few more donated WS.
lazyacevw wrote:
I'm not sure what parts of sukritsingh's post were typos but my first though was to SSD space. For consumer SSD's it's around $100/TB so for 100TB of SSDs, it would run around $10,000. Plus, you would need the servers to support the roughly 100 SATA connections (Say, 4 SANs?). With Azure you would need 4 of their E80 (32 TB Standard SSDs) at a cost of $9830 a month.
Other than that, I have a few dozen TB of space and a 1Gbps connection.
I'm sure sukritsingh is overworked right now! If the reqs are Debian Linux machine with 8GiB RAM, 100GB SSD storage, and 1Gb/s network, I could see quite a few more donated WS.
WU response could likely be better handled by Azure's Blob storage APIs. That permits significantly cheaper $/GB and allows you to offload the IO performance scaling onto Azure.
Plugging into their pricing calculator: 100TB with 86,400,000 (120,000 WU assignments/hr * 24 * 30) write operations and twice as many read operations comes out to $3017/mo with $2087 of that being the raw cost of storage itself.
I'm not sure how long WUs are kept around for nor how frequently they are pulled, but if we can move storage tier down to Cold or Archival, that price can go down a lot.
The 100 TB in his post was not a typo, just a simplification of a requirement list I have also seen. That stated 50-100 TB.
Multiple active projects can go through a lot of storage and quickly. As needed they do transfer off the data to other storage. Some of that movement is to other machines for analysis of the data. They do keep the actual WU data around for a long time, part of that is so that other researchers can look at it if wanted.
iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
That was my point exactly. I could put up 100TB but it would be on 7200 RPM drives. I only have about 10TB of SSD based storage. I'm not a millionaire!
I would like to throw in my 2 cents here, I know my post may not hold much weight as it's my first ever post. However, I've never needed to post before, Folding@Home has always just worked for me. I have skimmed a majority of this thread although I will admit I have skipped a few posts, so apologies if this has already been covered before...
I personally think that it's time for a wider re-architect of Folding@Home's backend to be more "cloudy". Simply building assignment and work servers in various places is fine although the requirements for these are quite heavy and expensive, consuming other cloud services may be more effective and allow Folding@Home to horizontally scale in a better fashion.
My current understanding:
- The client makes a request to the assignment server which provides a work server in a round-robin style fashion.
- The client requests work from the work server provided by the assignment server
- The client does the work and sends the results back to the work server
- Work servers need huge disks to store the work unit and the results
- Work servers at the moment are highly strained due to the number of new donors and this causes work units to stop being handed out and clients sit doing nothing
My observations:
- Although Folding@Home is an amazing distributed system from the client/compute perspective the backend is largely old school in architecture
- Although you can scale work servers they have heavy requirements making them unattractive
My thoughts:
- Work units and results need to be stored on some form of distributed, HA object store such as S3, Swift, Ceph
- Pending work units should be queued using a proper queue mechanism such as RabbitMQ cluster
- Instead of having many work servers each with distinct jobs and storage, these should be replaced by a stateless work microservice which interfaces wth said queue and object stores in a load-balanced pool
- When the load gets heavy work API nodes can be scaled horizontally and quickly based on demand
In terms of the cost, of course, it comes with a cost. I'd hope that the likes of AWS / Azure / GCP would donate the cloud resources to make this happen, but if this isn't possible these large providers often aren't the most cost-effective and a small OpenStack cloud in house would allow such delivery if enough hardware can be purchased to build it out.
I know the donor spike is likely to tail off after COVID-19 has been resolved, however having a cloud-based work unit management that scales properly and in a cloudy fashion would allow Folding@Home to retract to its previous size (automatically if desired) and then scale back up again in case (when) we need it again in the future.
I have very limited time available at the moment, however, I would be more than happy to provide architectural/cloud advice as required, although may not be able to do much hands-on work.
robputt wrote:
I personally think that it's time for a wider re-architect of Folding@Home's backend to be more "cloudy". Simply building assignment and work servers in various places is fine although the requirements for these are quite heavy and expensive, consuming other cloud services may be more effective and allow Folding@Home to horizontally scale in a better fashion.
While I definitely agree, and at the same time understand that the team is still putting out fires, I feel that with limited effort it should be possible to add some simple load-balancing logic (to replace the round-robin strategy) to the assignment servers? For example, on the AS, restricting the number of WS X designations to Y clients over period Z. The limits could differ for each WS depending on its capabilities, so that for example older/more restricted servers would only get say 500 requests per 2 minutes, where higher performance ones might be able to handle 1000 requests per 2 minutes. (Obviously I don't know actual reasonable numbers here, but the team should have a fair grasp on the avg. time needed to serve each client and thus be able to calculate the optimal utilization)
Guess I'm not sure the clients could currently handle a potential case of a "no available WS"-reply from an AS; but a hacky way around it could simply be to assign the client to 127.0.0.1 and let it time-out/reject itself.
(Then there should really be an upper limit to the client retry timer, but recognizing that would probably require a new client, that sounds like more effort with limited impact in the current situation)
I would like to throw in my 2 cents here, I know my post may not hold much weight as it's my first ever post. However, I've never needed to post before, Folding@Home has always just worked for me. I have skimmed a majority of this thread although I will admit I have skipped a few posts, so apologies if this has already been covered before... Rob
Excellent analysis. Unfortunately, without people forming teams and working over time to implement, this probably won't be implemented anytime soon. As alxbelu mentioned, the current team only has enough in-house resources to put out fires. There is some hope that partnerships might be formed with Azure, AWS, etc. where they could donate a few short term teams to help implement such architecture but that would be like winning the lottery. The only surefire way forward is to donate resources the best you can. I am also slammed at work so I am not in the position to volunteer right now. In the mean time, all I can do is spin up 64 core servers to replace individual clients in effort to reduce the number of client requests and try to help out in the forums.
Interested in helping us develop and enhance the F@h architecture? Come join the development team at our F@h Fireside Chat on Thursday, April 9th, 4-5pm EDT! To receive a Discord invite link, fill out this form: https://tinyurl.com/firesidedev