Page 2 of 2

Re: Failing Server - 128.174.73.74:8080

Posted: Sun Jul 09, 2023 1:23 pm
by [Ars] For Caitlin
I've looked back through my logs for the last few weeks and I don't think I've ever actually gotten a work unit from this server. Based on the hostname, this is Dr. Paul at UI. Is there perhaps an issue with generating work units for these projects? There's obviously a disconnect between what the assignment server thinks is available and what is actually available.

Re: Failing Server - 128.174.73.74:8080

Posted: Mon Jul 10, 2023 2:35 am
by BobWilliams757
[Ars] For Caitlin wrote: Sat Jul 08, 2023 6:20 pm It would be nice if the "Failed to assign" condition was detected and the assignment server round robin'ed among the other servers. I just got one from .4.210 which isn't even supposed to have work units based on the server status page.

SandyG wrote: Sat Jul 08, 2023 8:04 pm
[Ars] For Caitlin wrote: Sat Jul 08, 2023 6:20 pm It would be nice if the "Failed to assign" condition was detected and the assignment server round robin'ed among the other servers. I just got one from .4.210 which isn't even supposed to have work units based on the server status page.
Yeah, it seems to do a progressive delay retry to the same server for a while (it could be random selection, but not sure), then at some point picks up some other server. Not sure what the logic might be but should hit a new server each time if a problem on the last server hit. Might be that there really isn't that many active servers, not sure.

Hopefully it something like just adjusting the allocation to the .74 server, but someone more plugged in will have to comment.

Sandy

In my case, if I get the "Failed to assign" error, it does bounce to another server. I don't recall ever seeing an instance where it gave me the error and then tried the exact same assignment server a second time.

It might be possible that it is either a random "bounce" to try other servers and the same is selected, or it could have to do with work units that are suitable for your GPU(s) that have the issue. I do know that on Discord toTOW had inquired about the assignments and returns for the faster GPU's, those considered Species 8 or 9, and how there was some difficulty finding enough suitable work units for them and keeping them fed.

Though just a guess, I'm going to assume that most projects are all located on a single server, so the projects that fit certain GPU's might only exist on a small number of servers in comparison to the total number of servers.


In any case, more information is usually better. Please note delay times where you catch them, and what type of hardware is waiting on the work unit.

On my Species 7 GPU, I've had quite a few errors, but the assignment delays are very minor, most often under 10 seconds total time as compared to the perfect case scenario. On uploads my issues have had greater time impacts, but I haven't had any upload delays over a few seconds for a week or two now I think.

Re: Failing Server - 128.174.73.74:8080

Posted: Mon Jul 10, 2023 9:08 pm
by toTOW
I think the main issue here is that 128.174.73.74 is hosting one of the few projects that will assign to big GPUs like yours ... but it is hard to find project that will scale well on such GPUs ... :(

How much PPD are you prepared to loose in exchange of an increased supply of WUs ?

Re: Failing Server - 128.174.73.74:8080

Posted: Tue Jul 11, 2023 3:10 am
by SandyG
toTOW wrote: Mon Jul 10, 2023 9:08 pm I think the main issue here is that 128.174.73.74 is hosting one of the few projects that will assign to big GPUs like yours ... but it is hard to find project that will scale well on such GPUs ... :(

How much PPD are you prepared to loose in exchange of an increased supply of WUs ?
Well, not sure how it works, but from my standpoint if no larger WU's are available, assign a smaller one. Not at all sure how specific WU's are to GPU's but seems odd that it would not just send one that may not be a BIG one.

Makes me not want to continue to invest in any more hardware for FAH if not enough larger WU's to utilize the hardware that is pretty common now.

Re: Failing Server - 128.174.73.74:8080

Posted: Tue Jul 11, 2023 10:01 pm
by BobWilliams757
Without knowing the details of the assignment methods, all of this is just a guess, BUT.....

What is priority in research today might change tomorrow. Look how quickly COVID changed the landscape of folding. And as those research priorities change, so might the types of project constraints that are making the projects most suitable for the power cards few and far between right now. It could easily be that in a few months, the project shortages will apply to less powerful GPU's and the power GPUS will be in full feast mode.

I think GPU's have advanced so quickly that we might be at the point of "peak GPU" for at least the short term future. The cards have so many cores now that less and less projects can fully take advantage of them. I have no idea of what the power cards get for assignments, but for me and my little 1660 Super, I get assignments as short as an hour and a half to 9 or 10 hours on occasion. And though points really don't matter to me other than as a measure of what I've done, it makes sense to try to assign to the GPU's that will be efficient for any given project.


That said, it would be interesting to know more about what factors have to be considered for the project assignments, and if there is any way to alter projects to suit available hardware as the technology changes.


Either way, I'm glad someone tries to sort it out. If anyone could use a little help at times, I'm game.

Re: Failing Server - 128.174.73.74:8080

Posted: Tue Jul 11, 2023 11:26 pm
by SandyG
In watching logs the last couple of days, seems less noisy on the fails from the .74 server, and faster retry to a different server. Not sure if that's just my imagination...

Saw this today, failed fast to .202 server

Code: Select all

22:12:52:WU00:FS03:Requesting new work unit for slot 03: gpu:182:0 AD102 [GeForce RTX 4090] from 128.174.73.74
22:12:52:WU00:FS03:Connecting to 128.174.73.74:8080
22:12:53:WARNING:WU00:FS03:WorkServer connection failed on port 8080 trying 80
22:12:53:WU00:FS03:Connecting to 128.174.73.74:80
22:12:53:ERROR:WU00:FS03:Exception: Failed to connect to 128.174.73.74:80: Connection refused
22:12:53:WU00:FS03:Connecting to assign1.foldingathome.org:80
22:12:53:WU00:FS03:Assigned to work server 129.32.209.202
@Bobwilliams757 - the .74 server failed to deliver x86 as I recall from the past, so not sure if strictly GPU related. I have turned off all x86 processing as it just seems to make heat on the motherboards CPU with little penalty on points when all said and done. Helps get rid of a bit of heat in the room during the summer :D

Would love to learn more how WU's are generate and if GPU's cores should really matter now a days. I watched the memory use on the GPU's and it's in the 1/2 Gig range for the 3090 and 4090's, they all have 24G... Likely another thread for this topic.

Sandy