Getting rejected by assignment server (GPU)

Post by **PantherX** » Fri Jan 10, 2014 6:42 pm

netblazer wrote:...You need to have a single machine catch all requests and reroute has needed (that's what a router does). So that whenever you need to redirect the traffic elsewhere (downtime, new servers, new types of servers), you just change a redirect on 1 machine and you're done. And you can build one from the scrap hardware you have lying around if you don't have 1-2K to buy one, just Google it...

Originally, that was the plan but it changed. The first announcement stated that the new backup GPU AS Server would be supported by all V7 Clients:

...We architected support for this some time ago, so all v7 clients should be able to use that redundant AS, so if one goes down, one can still get WUs.

However, once things got into motion for setting up the backup GPU AS Server, they discovered that it was new to the F@H infrastructure:

...The use of backup GPU AS is new to the FAH infrastructure and we now see that to make it work, we’ll need to roll out a new client...

What changed in the infrastructure remains unknown to me.

netblazer wrote:...Seriously, now you expect to have 2-3 million people just re-download the latest software?...

While we don't have 2 to 3 million donors yet, you do raise a good point about awareness. For the most part, the update is needed to use the back-up GPU AS Server, thus, if the primary GPU AS Server remains online and fully functional, there won't be an issue for any donors running V7. If the Primary GPU AS Server does go offline, those using the yet-to-be-released V7 should automatically switch over and not be effected by this. However, those that are using the older version, will notice this issue and sooner or later, they will search for the solution on the internet and come across it and solve it.

Post by **VijayPande** » Fri Jan 10, 2014 6:55 pm

netblazer wrote:
PantherX wrote:The new back-up GPU AS Server is released but will need a new FAHClient to work with (http://folding.stanford.edu/home/new-ba ... er-update/).
No offense but that's certainly one of the stupidest thing I've ever heard. You need to have a single machine catch all requests and reroute has needed (that's what a router does).

I think you're missing the point -- the goal here is to remove single points of failure. Your proposed solution creates a new single point of failure. Moreover, with our current plan, we also gain some more flexibility in the future with how we handle the two AS's. One alternative quick fix is to assign the DNS name assign-gpu.stanford.edu to multiple IP's and let DNS round robin it for us. However, I've seen issues with this, especially if DNS gets flaky or screwed up.

Finally, while I appreciate the comparisons to Google and Facebook, we're running on a dramatically smaller hardware and personnel budget than they are. If we had the funds to build what they do, we wouldn't be running a volunteer distributed computing project, but just computing directly on that infrastructure, as we have done in a recent collaboration with Google.

johnblommers · Post by **johnblommers** » Fri Jan 10, 2014 9:02 pm

7im wrote:FAH has to do it with low cost options. It's way easier to hard code a server address in a new client than to put up the kind of multiple Network Operation Centers like Google and your Bank has to monitor, maintain, staff, upgrade, etc. Target just got hacked. 1000s of credit card #s and personal information stolen. When was the last time you heard of FAH getting hacked?

Agreed 100%. Let me weigh in that security and confidence are HUGE for F@H. To my knowledge there has never been a security problem. Cores uploaded by clients have never been tainted by malware. Everything is digitally signed so IP masquerading/spoofing by hacker servers is defeated. If FAH ever gets tainted by a security breach it will devastate the user base confidence, significantly erode trust, and result is a massive reduction in voluntary CPU and GPU folding. That will directly impact its usefulness to the school.

People choose to Fold. We want that experience to be safe and idiot proof. Let the form follow the function.

7im · Post by **7im** » Fri Jan 10, 2014 9:38 pm

netblazer wrote:...

The only real difference here is that you have nothing worth stealing...

True. And yet many DC projects have problems with points cheaters hacking their points system.

Personally, I stopped contributing to SETI@home for a while because of cheating, then later finished out the project when it was fixed.

warp-9.9 · Post by **warp-9.9** » Sat Jan 11, 2014 5:36 pm

netblazer wrote:...The only real difference here is that you have nothing worth stealing...

Maybe not worth stealing, as the eventual results are to be given away for free. But a malicious person or group may exist and may have other motivations, such as to prevent research of this nature, for no other particular reason. People really are like that, all over the world.

Also, as a fictional example, what if a major drug company spent billions of dollars and decades of research and are poised to release a breakthrough, and could make trillions of dollars with a cure, but their competitors might make a similar drug based on the results of this research, with minimal investment in their own drug research. They might not be able to resist the temptation to hinder this research project. They could even outsource the problem to an unpopular country with a reputation for doing such things, and claim plausible deniability. Corporations really are like that, all over the world.

So someone could benefit not by stealing, but by destroying what's there, breaking confidence, slowing contributions and draining resources trying to assess and repair damages.

VijayPande wrote:
netblazer wrote:
PantherX wrote:The new back-up GPU AS Server is released but will need a new FAHClient to work with (http://folding.stanford.edu/home/new-ba ... er-update/).
No offense but that's certainly one of the stupidest thing I've ever heard. You need to have a single machine catch all requests and reroute has needed (that's what a router does).
I think you're missing the point -- the goal here is to remove single points of failure. Your proposed solution creates a new single point of failure. Moreover, with our current plan, we also gain some more flexibility in the future with how we handle the two AS's. One alternative quick fix is to assign the DNS name assign-gpu.stanford.edu to multiple IP's and let DNS round robin it for us. However, I've seen issues with this, especially if DNS gets flaky or screwed up.

I tend to strongly side with netblazer on the aspect of designing a fault-tolerant system, with redundancy. It can be done economically and securely.

It's a poor argument to say that having a router is the only single point of failure.

Are all the servers in the same building? On the same campus? Will a natural disaster like earthquake, wildfire, mudslide, etc. wipe out all of the servers and data'?

How about network redundancy? If a construction worker bumps/breaks a conduit, will it take down the project? Or do network cables feed into the building/campus from multiple directions?

How about ISP redundancy? If someone at your ISP, or their upstream ISP, fiddles with a route's netmask, and suddenly 1 million people on the other side of that mask can't access your server, do you have multiple ISP's who also don't use the same upstream ISP?

How about DNS redundancy? Is your DNS company fault-tolerant? Does your DNS company maintain servers around the world? I know of one Dynamic company that provides industry-leading DNS and related services fairly cheaply. The name of the company is hidden in the previous sentence. I am not a stakeholder, have nothing to gain by name dropping, just one company I am aware of that specifically strives to address this problem. Others may exist.

How about electrical redundancy? Does each server have a dual hot swap power supply? Is there a high capacity battery capable of running all the servers for 1 day, with no failover time delay? How about for 1 hour or even 1 minute, while a diesel generator comes online (those can have several second delays). How about electrical grids? Are you supplied electricity by a single grid? Or do multiple grids overlap your area?

How about data redundancy? Are you using RAID1 at a minimum? RAID5 or RAID6 might be better. For instance, RAID5 on RAID5 with hot spare and hot swap drives, I have read, is one way to ensure graceful degradation with 0 downtime, just replace drives and rebuild. How about keeping archived backups? Are they all stored on-site? All in same off-site location?

Routers are just one of several "single points of failure", just like a network cable, power cable, or data storage device. But these technologies can all be utilized in ways which increase redundancy, reduce single points of failure, degrade gracefully, and give extra hours or days (hopefully not that long!) in which to respond and fully repair a problem. Although some more esoteric solutions, especially involving hard drives, can be more costly, it may be possible to solicit donated hardware? If the project was set up to be run by a registered non-profit charitable organization, other companies could even write it off as a tax credit.

I am by no means a networking expert, yet I have been exposed to these problems (as primary tech support for residential and business internet service and data center operations) and solutions when I worked at a small ISP with less than 10 employees. So I know a bit about small staff and budgets too. All of that was over a decade ago, and technology has advanced so much in that time. And yet, sound fundamental design never ages.

Anyways, even if no hardware infrastructure policies or changes are ever going to be made (and I sense this is the stubborn predisposition), there has to be a better way to address the problem than hard coding lists of addresses into clients. Using a secure and authenticated means of transmission, you should have a means of retrieving a list from your servers, to logically decouple the dependency between existence of new backup servers, and the client's knowledge of them.

Pushing a new client out every time a new server comes online, is idiotic. Having 12-hours/week downtime hurts credibility, almost as much as corrupt data, and is equally unacceptable. Tap into the great corporate and political wealth and power base that calls Stanford their Alma Marta and get a few people to crack open their piggy banks or lend some brain cells and eyeballs to help come up with more robust yet equally secure solutions.

Anyways, I'll go back to silently crunching data now. Thanks and good luck with the project. Rank 3689 of 1718675 users, having completed 4130 work units. Respectable, considering I only have 1xAMD FX-8150 8-core CPU and 2xnVidia/EVGA 480GTX cards (a.k.a. the space heaters), running for just under 2 years now (22 months, with windows opened in winter even at negative Fahrenheit temperatures, if only 1/8", just enough to keep cards below 80C). If only the GPUs/Folding cores had a process priority system, I'd be able to let the GPUs run while watching videos or playing games or doing heavy surfing...

Post by **Jesse_V** » Sat Jan 11, 2014 6:13 pm

warp-9.9 wrote:If only the GPUs/Folding cores had a process priority system, I'd be able to let the GPUs run while watching videos or playing games or doing heavy surfing...

There's nothing Folding@home can do about that, it's a problem that outside of its scope. The problem resides with the architecture of the GPU and their drivers. GPU manufacturers have been making the reasonable assumption that applications never fight over the GPU, and heavy applications have it exclusively. For the majority of their userbase (gamers and casual users) this holds true. However, the assumption breaks down when distributed computing projects come in and start using the GPU in the background. The only way Folding@home has been able to work around this problem is implement a feature wherein the GPU folding waits until the computer is not being used before firing up the GPU cores to do work. This prevents the system from lagging from normal use, and is one option. On my system the lag isn't too bad, so I can live with it just fine. For gaming, I pause the GPU work.

Post by **PantherX** » Sat Jan 11, 2014 7:33 pm

warp-9.9 wrote:...Are all the servers in the same building? On the same campus? Will a natural disaster like earthquake, wildfire, mudslide, etc. wipe out all of the servers and data'?...

Within F@H, there are different universities participating. Thus, you will find that the Servers are scattered across different geographical regions.

warp-9.9 wrote:...How about data redundancy? Are you using RAID1 at a minimum? RAID5 or RAID6 might be better. For instance, RAID5 on RAID5 with hot spare and hot swap drives, I have read, is one way to ensure graceful degradation with 0 downtime, just replace drives and rebuild. How about keeping archived backups? Are they all stored on-site? All in same off-site location?...

At least Bowman lab at the University of California is using RAID 6 (http://folding.stanford.edu/home/under- ... -berkeley/). Not sure what the other Servers are using. Regarding the analyzed data, using the Stanford Digital Repository is being considered (viewtopic.php?p=255197#p255197).

warp-9.9 wrote:...Pushing a new client out every time a new server comes online, is idiotic...

Please note that not all Servers are the same. There are plenty of Work Servers (WS) and Collection Servers (CS) that come and go without any need to update the F@H Client. However, this particular release address the redundancy of the Assignment Server (AS), related to the GPUs, which happens rarely.

johnblommers · Post by **johnblommers** » Sat Jan 11, 2014 7:55 pm

One way to achieve additional redundancy is to make the FAH Clients DNS aware. The hostmaster can configure DNS so that an nslookup for gpu.stanford.edu can return a list of valid IP addresses, and the FAH client simply tries each IP in the list until it succeeds. Add a few more servers, or change their IP addresses, and you simply tweak the DNS records at the authoritative DNS server for the domain.

This capability has been available for eons. I fondly

remember the day I did a telnet to a troubled router, and watched as the telnet client tried each interface (returned by DNS) until if found one that would allow it to connect. To that point I had no idea telnet could do that.

Good old telnet, when networks were still open, and there was still trust in the air.

Post by **bruce** » Sat Jan 11, 2014 8:48 pm

johnblommers wrote:One way to achieve additional redundancy is to make the FAH Clients DNS aware.

The reliability (and redundancy) for a connection to an Assignment Server (AS) is critical, as we have seen recently. Reliability and redundancy to Work Servers (WS) uses totally different logic (and there are dozens of them to choose from).

The FAH Client is DNS aware. It uses DNS to establishing its initial connection to a Stanford AS. The client does not use DNS to connect to a WS as they often don't have WUs that your hardware can process. The AS, itself, keeps track of which Work Servers can fulfill your request for a new WU and "assigns" you to the best server choice available, not just to a random choice like DNS would.

In those rare instances when DNS needs to be changed, world-wide propagation delays add appreciable lag so it's not as useful a tool as you suggest. (By the way, Stanford was one of the founding universities of the Internet in the 1960s working under DARPA so they do have plenty of network-knowledgeable people.)

In the beginning, telnet could not do that. A new client had to be distributed that could.

hschulze · Post by **hschulze** » Mon Mar 03, 2014 6:28 am

I have a GTX460 and a few other machines (one with a Titan) getting the server assignment error

02:23:30:WU01:FS01:Connecting to assign-GPU.stanford.edu:80
02:23:30:WU01:FS01:News: Welcome to Folding@Home
02:23:30:WARNING:WU01:FS01:Failed to get assignment from 'assign-GPU.stanford.edu:80': Empty work server assignment
02:23:30:WU01:FS01:Connecting to assign-GPU.stanford.edu:8080
02:23:30:WU01:FS01:News: Welcome to Folding@Home
02:23:30:WARNING:WU01:FS01:Failed to get assignment from 'assign-GPU.stanford.edu:8080': Empty work server assignment
02:23:30:ERROR:WU01:FS01:Exception: Could not get an assignment

I see servers have GPU workunits, but I am not getting any. Fresh 12-hour old install of Win7/64, driver 334.89. client-type=advanced. CPU workunits are ok.

If assign-gpu isn't a router, why is it stuck? Should I try to force the IP to another machine?

P5-133XL · Post by **P5-133XL** » Mon Mar 03, 2014 8:21 am

Please show your log, including the system and config portions.

Post by **Joe_H** » Mon Mar 03, 2014 1:23 pm

hschulze wrote:I have a GTX460 and a few other machines (one with a Titan) getting the server assignment error

I see servers have GPU workunits, but I am not getting any. Fresh 12-hour old install of Win7/64, driver 334.89. client-type=advanced. CPU workunits are ok.

If assign-gpu isn't a router, why is it stuck? Should I try to force the IP to another machine?

What is the result of trying to connect to http://assign-gpu.stanford.edu:8080 and http://assign-gpu.stanford.edu in a browser from this system? You should get a blank screen with an "OK" in the top left corner.

Folding Forum

Getting rejected by assignment server (GPU)

Re: Getting rejected by assignment server (GPU)

Re: Getting rejected by assignment server (GPU)

Re: Getting rejected by assignment server (GPU)

Re: Getting rejected by assignment server (GPU)

Re: Getting rejected by assignment server (GPU)

Re: Getting rejected by assignment server (GPU)

Re: Getting rejected by assignment server (GPU)

Re: Getting rejected by assignment server (GPU)

Re: Getting rejected by assignment server (GPU)

Re: Getting rejected by assignment server (GPU)

Re: Getting rejected by assignment server (GPU)

Re: Getting rejected by assignment server (GPU)