Page 3 of 5

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 4:10 pm
by aaronbrighton
TMStech wrote:It's true that 90% of everything - including unsolicited advice from noobs - is crap. Often for reasons that they really would understand, if they had been here for as long as you have. But no system is perfect, and rejecting all input out of hand (because you "know" you're already doing it the best way) means rejecting any chance to make something good into something great.
(clap)

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 4:28 pm
by bruce
The managment team at FAH has been in negotiations with various potential resource suppliers. Not surprisingly, they don't talk about that publicly,

I have not seen anything like the AntiMastodon comments you reported. Did I miss them?
Just keep meeting their offers to help scale up with meaningful silence.
I understand that's your perspective.

I do consider it part of my responsibility here to pass on certain personal observations to individuals in FAH's management while not flooding them with unnecessary emails. (I do have personal contacts with many of them and generally understand who-is-who... but I don't presume that my judgement is perfect. They do have other sources.) Nevertheless, I DO NOT repeatedly say "That's a great offer ... I'll pass that on to management."

It shoold be noted that "silence" is not the same as "inaction"

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 4:58 pm
by Harmin
Also it should be kept in mind that I have seen about 2 Admins active in these days, they are just human as the rest of us, keep calm and keep folding, if they need help they will probably either be asking around or contacting people that offered said help.

Personally I don't know enough about this stuff, as I am a chemist not a biochemist, big difference, and not enough about the tech, I will just keep my PC on and see what comes my way, if it isn't always active, it just means they are at capacity for the moment and I see that as a good thing.

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 5:58 pm
by aaronbrighton
bruce wrote:It shoold be noted that "silence" is not the same as "inaction"
How would outsiders or the individuals offering advice/suggestions have any clue whether silence means inaction or not? It's only natural when someone doesn't get a response from an organization they know little about, that they'd assume the message was either a) not seen, or b) ignored. A simple acknowledgement can go along way. I'm not saying that's your responsibility, but those who own the project really should -- especially if this is the official forum.

Communication and transparency is important. Especially when people are offering up their valuable time, resources, and in some cases $$ to help the project. Thus far, officially, the news page on the site has posts once every 5 days, and hasn't addressed the capacity issues, stats API issues, many of us have been struggling with -- the communication would indicate "inaction" in those areas.

It's 2020, and from a technologists perspective, I see a highly distributed computing system (and extremely valuable resource to many groups who are less fortunate), which seems (looking from the outside) to have a number of technical bottlenecks both in regards to core infrastructure compute, bandwidth, and storage capacities, who's resources are being underutilized. The limited communications that I have been able to find via individuals on Twitter and this forum:

[*] Major cloud provider donated enormous GPU resources
[*] WU servers can't keep up, trying to provision more -- possible storage limitations as well.
[*] Can't produce work fast enough for the network to work through.

The first indicates a willingness on the behalf of cloud providers to help out. The second really shouldn't be a problem in 2020 if systems are setup in the cloud to scale on-demand, and based on some very limited calculations the actual WU servers shouldn't really need that much compute/memory resources (8-16 requests/second by my calculations) if detached from the storage/high bw requirements which should be offloaded to an object store directly and not funneled through the WU app servers. The stats API... that should be really easy problem to solve unless there are some serious issues with how the stats data is actually stored/catalogued.

Anyways, I understand... Growing pains, small team, mostly volunteers, largely focused on what's important producing work for the network to crunch. Just hoping that moving forward resources that are being allocated to those bottlenecks are being deployed in the most optimal way to make the project much more versatile in the future. The cloud tech options may be unfamiliar to the core team, but it doesn't eliminate the merits, so it's worthwhile engaging (volunteer) experts in this area -- even if it isn't those active on this forum. If the communications indicated this is happening, it would probably satisfy many of us who have concerns about how the project is moving forward in these areas.

I'll leave my thoughts at that, I'm going to shutdown my Cloud NVIDIA Tesla T4 GPU servers dedicated to this project as they've been getting "Empty work server assignment" and been idle for nearly past 2 days.

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 6:17 pm
by alxbelu
aaronbrighton wrote:
bruce wrote: I'll leave my thoughts at that, I'm going to shutdown my Cloud NVIDIA Tesla T4 GPU servers dedicated to this project as they've been getting "Empty work server assignment" and been idle for nearly past 2 days.
I agree with most of what you're saying, though I think it might have been even more important to be able to quickly deploy a comms team to manage the new community, giving the core operations team some breathing room to figure out the technical/practical bottlenecks, as it is, I'm guessing basically the regular team has been under something similar to a DOS attack as well with suggestions and proposals for collaborations....

In any case, what I really wanted to ask you was which cloud provider you used; I too was considering donating some extra power that way, but when I realized how overloaded the infrastructure was, I held off. Once the bottlenecks have been solved though I'd still like to contribute more, and I'd like to spin up such a solution.

edit: also just wanted to clarify that I think the volunteer mods/managers and oldtimers have overall done a great job here, however there's simply too few of them for the time they can collectively muster, and perhaps more importantly, the various channels (website, forum, social media) could have used a more cohesive response.

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 6:47 pm
by aaronbrighton
alxbelu wrote:In any case, what I really wanted to ask you was which cloud provider you used; I too was considering donating some extra power that way, but when I realized how overloaded the infrastructure was, I held off. Once the bottlenecks have been solved though I'd still like to contribute more, and I'd like to spin up such a solution.
Certainly! I'm using AWS. I've been running a few different instance types:

[*] g4dn.xlarge - Started with this, Nvidia Tesla T4 GPU -- was netting me around 100,000-150,000 credits/day/server when it was being used.
Nvidia Tesla T4

[*] p2.xlarge - Tried this as it wasn't clear to me whether FAH would benefit from better double precision operations or not, answer is not, went back to g4dn.xlarge.
Nvidia K80

[*] c5.4xlarge - Switched some to this yesterday as GPU's have been running idle, still some CPU WU, netting me around 15,000-20,000 credits/day/server.
16*Cores on 2nd generation Intel Xeon Scalable Processors (Cascade Lake) @ 3.4GHz

[*] c5.9xlarge - Switched the rest to this now that GPU's have been idle for nearly 2 days, still some CPU WU, netting me around 25,000-30,000 credits/day/server.
32*Cores on 2nd generation Intel Xeon Scalable Processors (Cascade Lake) @ 3.4GHz

I wrote a quick Python script (which I'm running in an AWS Lambda function for simplicity sake) to ascertain the cheapest spot price region/availability zone for a given instance type:

Code: Select all

import boto3

def lambda_handler(event, context):

    instanceTypeToPrice = 'g4dn.xlarge';
    
    # Get regions
    ec2=boto3.client('ec2')

    report={};
    for region in ec2.describe_regions()['Regions']:
        client=boto3.client('ec2', region_name=region['RegionName']);
        for avz in client.describe_availability_zones()['AvailabilityZones']:
            prices=client.describe_spot_price_history(InstanceTypes=[instanceTypeToPrice],MaxResults=1,ProductDescriptions=['Linux/UNIX (Amazon VPC)'],AvailabilityZone=avz['ZoneName'])
            if len(prices['SpotPriceHistory']) > 0:
                report[str(prices['SpotPriceHistory'][0]['SpotPrice'])] = avz['ZoneName']
                
    for sortedKey in sorted(report):
        print(sortedKey+" -> "+report[sortedKey])
Lambda function was given permissions to query EC2, and I simply updated the instanceTypeToPrice variable as necessary in the code above to whatever instance type I was interested in finding the cheapest price for (g4dn.xlarge, p2.xlarge, c5.4xlarge, c5.9xlarge).

For example, if you run the above script for g4dn.xlarge you'd find that us-east-2c has the cheapest spot price at that time @ roughly $0.16/hour. Not bad considering it would cost me close to $3,000-$3,500 to buy the hardware, and pay for electricity, I could run this perpetually in AWS for nearly 2.5 years before breaking even if I bought the hardware.

Code: Select all

0.157800 -> us-east-2c
0.159500 -> us-east-1f
0.167300 -> us-east-2a
0.167400 -> eu-north-1a
...
Therefore I just setup a spot instance request in us-east-2c for g4dn.xlarge with an X number of instances.

Thought about maybe writing a tutorial and a CloudFormation template to automate all of this based on a simple $ value you want to contribute in CPU or GPU resources, so others could easily do the same and have it optimally allocate the resources and spin them up. Will hold off for now until, FAH figures out how to get enough work onto the network and how to scale up their internal systems.

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 6:54 pm
by aaronbrighton
aaronbrighton wrote:Thought about maybe writing a tutorial and a CloudFormation template to automate all of this based on a simple $ value you want to contribute in CPU or GPU resources, so others could easily do the same and have it optimally allocate the resources and spin them up. Will hold off for now until, FAH figures out how to get enough work onto the network and how to scale up their internal systems.
If FAH turns into a network that isn't 99-100% utilized 24/7, then I might look into adjusting the above automation to also integrate some serverless triggers so that it only scales up resources as new WU's become available. Might need to reverse engineer or dig through the agent code to figure out how it reaches out for work and how it locks onto that work, so that I can pass it on to an agent on a given instance that gets launched only as needed and torn down when the work is empty.

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 6:56 pm
by TMStech
How would outsiders or the individuals offering advice/suggestions have any clue whether silence means inaction or not? It's only natural when someone doesn't get a response from an organization they know little about, that they'd assume the message was either a) not seen, or b) ignored.
Particularly when
  • the silence is in response to offers of help and technical expertise with removing bottlenecks;
  • some of the responses could be characterized as "no, give us money and we'll fix it our own way";
  • there's no indication that the offers are even being considered;
  • those bottlenecks manifestly still exist.
I'm guessing basically the regular team has been under something similar to a DOS attack as well with suggestions and proposals for collaborations....
That might be true elsewhere, i.e. by email or telephone. That's completely believable, if dozens of news organizations (who know jack about any of this, and need to be brought up to speed from zero) all want their part of creating clickbait. All we know is what we see here.

And even if not, I'm sure that any degree of questioning truly might seem overwhelming compared to "crickets" a month ago. Perversely, this is particularly true if they're well-informed suggestions that can't be dismissed as easily as a reporter asking "Uh, can't you like speed up the electricity or something".

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 7:05 pm
by alxbelu
aaronbrighton wrote:Thought about maybe writing a tutorial and a CloudFormation template to automate all of this based on a simple $ value you want to contribute in CPU or GPU resources, so others could easily do the same and have it optimally allocate the resources and spin them up. Will hold off for now until, FAH figures out how to get enough work onto the network and how to scale up their internal systems.
Many thanks! While I have extremely limited experience with AWS or cloud infrastructure per se, I do think I'd be able to use your instructions after some trial and error (at least know my way around terminals and *nix environments), however I would also definitely be helped by a tutorial and any templates - and I'm sure there are others that would appreciate that as well, when the FAH infrastructure can handle it, that is.

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 7:05 pm
by aaronbrighton
TMStech wrote:"Uh, can't you like speed up the electricity or something".
:lol:

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 7:10 pm
by raven562
My proposal for situation when new WU not available is to set up BOINC client and do rosetta@home simultaneously. In that way you can share your computing powers to another molecular biology project (COVID-19 related as I understand) and still can catch up new WU, when available. Both sides win.

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 8:15 pm
by allanmoller
Love this shit :-) lets create the biggest bot network in history to tackle this problem. How can I help? can we donate money so you can by more compute, maybe kickstarter project?

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 8:44 pm
by ftballpack
If the donation page is accurate, folding@home is a 501(c)(3) non-profit entity, meaning all donations are tax deductible.

If folding@home is need of donations, the main issue appears to be marketing the tax deductible aspects of donating to folding@home, with all of the current publicity around the project.

Has anyone reached out to the Bill and Melinda Gates foundation? If any "whale" exists that could help increase funding rapidly, it would be the Gates foundation.
My guess, the Gates foundation would likely be ecstatic to give a donation to folding@home, if the money is spent renting servers from Microsoft Aruze.

idk, food for thought.

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 8:51 pm
by ftballpack
Gates could be donating to fight pandemic diseases, which is something that he is passionate about, meanwhile Microsoft could get free press for being the organization that helped folding@home to scale at a crucial time, to fight covid-19.

On the surface, it appears to be a win-win to me.

Re: Please don't let this go to waste

Posted: Wed Mar 18, 2020 9:06 pm
by aaronbrighton
ftballpack wrote:Has anyone reached out to the Bill and Melinda Gates foundation? If any "whale" exists that could help increase funding rapidly, it would be the Gates foundation.
My guess, the Gates foundation would likely be ecstatic to give a donation to folding@home, if the money is spent renting servers from Microsoft Aruze..
Before FAH does that, I think they need to present a clear plan that shows with X resources (whether it be $$, CPUs, GPUs) we can produce Y positive outcomes, based on Z analysis or proofs. From the sounds of things at the moment, they have more than enough CPUs and GPUs. I haven't heard anyone saying that $$ is a constraint for the core bottlenecks -- seems to be a time constraint in setting up new core equipment or producing more tasks to be computed -- which is likely talent/headcount bound. However, they don't appear from the outside to be particularly interested in bringing others on-board to help tackle those issues at this time, so I'm not sure how $$ will really help at the moment.

If FAH can provide a wish list (per-say) that shows what they want/need to do and what the constraints are then large orgs might be willing to donate more resources (whether it be $$, CPUs, GPUs), from the sounds of it though, whoever donated the huge amounts of GPU resources have had them going to waste and will probably retract them if it continues. So this really needs to get tackled before they reach out asking for more resources.

/my opinion