Page 1 of 2
					
				Failing Server - 128.174.73.74:8080
				Posted: Tue Jul 04, 2023 5:26 am
				by SandyG
				From my logs I am seeing this a lot for the particular server. Switches to another and pics one up but seen a few while watching today (7/3/23)
Code: Select all
05:19:53:WU02:FS04:Connecting to 128.174.73.74:8080
05:19:55:ERROR:WU02:FS04:Exception: Server did not assign work unit
05:19:55:WU02:FS04:Connecting to assign1.foldingathome.org:80
Sandy
 
			
					
				Re: Failing Server - 128.174.73.74:8080
				Posted: Tue Jul 04, 2023 8:49 am
				by BobWilliams757
				Sandy,
I think toTOW is already watching this situation regarding both assignments and returns lately, as he has asked on the Discord channel.  I noticed lately it has been more the server you list, though others have done it as well for me.
If the behavior you are seeing is anything like what I'm getting the issue resolves quickly once assigned to another server.  In some cases it tosses that error on the same second it connects, in your case two seconds.  At any rate, most of my assignments still resolve in 10 seconds or so total, so at least it's not creating a big delay.
			 
			
					
				Re: Failing Server - 128.174.73.74:8080
				Posted: Tue Jul 04, 2023 3:35 pm
				by SandyG
				Seems to resolve quickly in some cases, others seems like it waits a while for retry. Also seeing this, not sure if related (looks to be a different server)
Code: Select all
15:15:48:WU02:FS02:Requesting new work unit for slot 02: gpu:101:0 GA102 [GeForce RTX 3090] from 129.32.209.202
15:15:48:WU02:FS02:Connecting to 129.32.209.202:8080
15:15:50:WU01:FS02:Upload 1.49%
15:15:50:WARNING:WU01:FS02:Exception: Failed to send results to work server: Transfer failed
15:15:50:WU01:FS02:Trying to send results to collection server
15:15:50:WU01:FS02:Uploading 29.46MiB to 158.130.118.26
15:15:50:WU01:FS02:Connecting to 158.130.118.26:8080
15:15:56:WU01:FS02:Upload 15.06%
15:15:58:WU00:FS04:0x22:Completed 2175000 out of 2500000 steps (87%)
15:16:00:WU05:FS01:0x22:Completed 512500 out of 1250000 steps (41%)
15:16:02:WU01:FS02:Upload 33.73%
15:16:06:WU03:FS03:0x22:Completed 1750000 out of 2500000 steps (70%)
15:16:06:WU03:FS03:0x22:Checkpoint completed at step 1750000
15:16:09:WU01:FS02:Upload 49.64%
15:16:15:WU01:FS02:Upload 65.77%
15:16:19:ERROR:WU02:FS02:Exception: 10002: Received short response, expected 512 bytes, got 0
15:16:21:WU01:FS02:Upload 77.65%
Not sure if normal, but in looking at my numbers for the last couple of days things seem lower then expected. Watching the box with nvidia-smi and seems OK but not sure on some of the numbers since I just started running it. Looks like cards are runnning 90%+ utilization. 
Might be just another patch of odd work units coming into play changing the average daily processing. Watching it but not seeing much to make a difference other then seems to have started 6/28 or so...
Sandy
 
			
					
				Re: Failing Server - 128.174.73.74:8080
				Posted: Wed Jul 05, 2023 8:21 pm
				by BobWilliams757
				Sandy,
The upload slowdowns can cost some points, especially with faster GPUs like you are running.  AFAIK this is the first reporting in recent times of this issue on this particular server.  I do know that toTOW has been inquiring about assignments and returns for the faster cards.  As for this server, I haven't had any real slowdowns on returns, just the errors and quick recovery on assignments.
IF you can, when it happens save some log files indicating the delay.  It might help those trying to get the servers working quicker.  I'll try to let toTOW know on the Discord channel just in case he doesn't notice it here soon.
			 
			
					
				Re: Failing Server - 128.174.73.74:8080
				Posted: Wed Jul 05, 2023 11:16 pm
				by SandyG
				Will keep an eye on it. It does seem to clear up pretty quickly, in a minute or so it seems. If I catch it again I'll capture some of the logs around it. 
Thanks
Sandy
			 
			
					
				Re: Failing Server - 128.174.73.74:8080
				Posted: Thu Jul 06, 2023 2:35 pm
				by [Ars] For Caitlin
				This has been like this for weeks it seems.
Code: Select all
14:31:26:WU01:FS00:0x22:Completed 990000 out of 1000000 steps (99%)
14:31:26:WU00:FS00:Connecting to assign1.foldingathome.org:80
14:31:26:WU00:FS00:Assigned to work server 128.174.73.74
14:31:26:WU00:FS00:Requesting new work unit for slot 00: gpu:9:0 Navi 21 [Radeon RX 6900 XT] from 128.174.73.74
14:31:26:WU00:FS00:Connecting to 128.174.73.74:8080
14:31:27:ERROR:WU00:FS00:Exception: Server did not assign work unit
14:31:27:WU00:FS00:Connecting to assign1.foldingathome.org:80
14:31:27:WU00:FS00:Assigned to work server 128.174.73.74
14:31:27:WU00:FS00:Requesting new work unit for slot 00: gpu:9:0 Navi 21 [Radeon RX 6900 XT] from 128.174.73.74
14:31:27:WU00:FS00:Connecting to 128.174.73.74:8080
14:31:28:ERROR:WU00:FS00:Exception: Server did not assign work unit
14:31:56:WU01:FS00:0x22:Completed 1000000 out of 1000000 steps (100%)
 
			
					
				Re: Failing Server - 128.174.73.74:8080
				Posted: Thu Jul 06, 2023 3:08 pm
				by Joe_H
				Looking at this server's entry on the Server Status page - 
https://apps.foldingathome.org/serverstats - it appears to be getting a relatively high assign rate, but only has about 1000 available.  It is possible that between updates to the AS that WUs were available and the time a specific client connection gets to the WS that it ran out.
 
			
					
				Re: Failing Server - 128.174.73.74:8080
				Posted: Fri Jul 07, 2023 3:20 am
				by SandyG
				Bob/Joe -
Thanks for the info Joe, that seems like a plausible reason. I just got hit again, total time down on one of the 3090's was about 25minutes before it finally received a WU from a different server. The back-off (I'm guessing that's the mechanism for retry) definitely makes it worse if it hits the same server.
I just happened to catch the start of the mess while tailing the logs and watching GPU stats (slow day 

 ). Here is the logs with some noise remove. The FS02 card was the one hit. I looked back and it's not related to card or CPU it seems, but only anecdotal while looking at the log and failed attempts. This was one of the longer ones I caught, sometimes it picks back up in a minute or so. 
Code: Select all
02:40:16:WU02:FS03:Assigned to work server 128.174.73.74
02:40:16:WU02:FS03:Requesting new work unit for slot 03: gpu:182:0 AD102 [GeForce RTX 4090] from 128.174.73.74
02:40:16:WU02:FS03:Connecting to 128.174.73.74:8080
02:40:16:WU03:FS04:0x22:Completed 450000 out of 1250000 steps (36%)
02:40:16:ERROR:WU02:FS03:Exception: Server did not assign work unit
02:40:17:WU02:FS03:Connecting to assign1.foldingathome.org:80
02:40:17:WU02:FS03:Assigned to work server 128.174.73.74
02:40:17:WU02:FS03:Requesting new work unit for slot 03: gpu:182:0 AD102 [GeForce RTX 4090] from 128.174.73.74
02:40:17:WU02:FS03:Connecting to 128.174.73.74:8080
02:40:18:ERROR:WU02:FS03:Exception: Server did not assign work unit
02:40:24:WU05:FS00:0xa8:Completed 25000 out of 2500000 steps (1%)
02:40:28:WU00:FS01:0x22:Completed 2025000 out of 2500000 steps (81%)
02:40:39:WU04:FS03:0x22:Completed 2500000 out of 2500000 steps (100%)
02:40:39:WU04:FS03:0x22:Average performance: 180 ns/day
02:40:40:WU04:FS03:0x22:Checkpoint completed at step 2500000
02:40:42:WU04:FS03:0x22:Saving result file ../logfile_01.txt
02:40:42:WU04:FS03:0x22:Saving result file checkpointIntegrator.xml
02:40:42:WU04:FS03:0x22:Saving result file checkpointState.xml
02:40:42:WU04:FS03:0x22:Saving result file positions.xtc
02:40:42:WU04:FS03:0x22:Saving result file science.log
02:40:42:WU04:FS03:0x22:Saving result file xtcAtoms.csv.bz2
02:40:42:WU04:FS03:0x22:Folding@home Core Shutdown: FINISHED_UNIT
02:40:43:WU04:FS03:FahCore returned: FINISHED_UNIT (100 = 0x64)
02:40:43:WU04:FS03:Sending unit results: id:04 state:SEND error:NO_ERROR project:18448 run:7 clone:32 gen:1426 core:0x22 unit:0x00000020000005920000481000000007
02:40:43:WU04:FS03:Uploading 29.48MiB to 129.32.209.202
02:40:43:WU04:FS03:Connecting to 129.32.209.202:8080
02:40:46:WU01:FS02:Connecting to assign1.foldingathome.org:80
02:40:47:WU01:FS02:Assigned to work server 128.174.73.74
02:40:47:WU01:FS02:Requesting new work unit for slot 02: gpu:101:0 GA102 [GeForce RTX 3090] from 128.174.73.74
02:40:47:WU01:FS02:Connecting to 128.174.73.74:8080
02:40:48:ERROR:WU01:FS02:Exception: Server did not assign work unit
02:40:49:WU04:FS03:Upload 16.32%
02:40:51:WU03:FS04:0x22:Completed 462500 out of 1250000 steps (37%)
02:40:55:WU04:FS03:Upload 34.76%
02:41:02:WU04:FS03:Upload 50.66%
02:41:04:WU00:FS01:0x22:Completed 2050000 out of 2500000 steps (82%)
02:41:04:WU00:FS01:0x22:Checkpoint completed at step 2050000
02:41:08:WU04:FS03:Upload 65.92%
02:41:15:WU04:FS03:Upload 83.73%
02:41:17:WU02:FS03:Connecting to assign1.foldingathome.org:80
02:41:19:WU02:FS03:Assigned to work server 129.32.209.202
02:41:19:WU02:FS03:Requesting new work unit for slot 03: gpu:182:0 AD102 [GeForce RTX 4090] from 129.32.209.202
02:41:19:WU02:FS03:Connecting to 129.32.209.202:8080
02:41:21:WU02:FS03:Downloading 61.64MiB
02:41:21:WU04:FS03:Upload 99.84%
02:41:24:WU04:FS03:Upload complete
02:41:24:WU04:FS03:Server responded WORK_ACK (400)
02:41:24:WU04:FS03:Final credit estimate, 403640.00 points
02:41:24:WU04:FS03:Cleaning up
02:41:25:WU03:FS04:0x22:Completed 475000 out of 1250000 steps (38%)
02:41:27:WU02:FS03:Download 7.00%
... these on 128.174.73.74:8080
02:43:25:ERROR:WU01:FS02:Exception: Server did not assign work unit
02:47:41:ERROR:WU01:FS02:Exception: Server did not assign work unit
New try on FS02, different server...
02:54:30:WU01:FS02:Connecting to 140.163.4.210:8080
02:54:31:ERROR:WU01:FS02:Exception: Server did not assign work unit
Finally assigned a work unit at 03:05:30, completed download at approx 03:06:05 where processing restarted
03:05:34:WU01:FS02:Connecting to assign1.foldingathome.org:80
03:05:35:WU01:FS02:Assigned to work server 129.32.209.202
03:05:36:WU01:FS02:Requesting new work unit for slot 02: gpu:101:0 GA102 [GeForce RTX 3090] from 129.32.209.202
03:05:36:WU01:FS02:Connecting to 129.32.209.202:8080
03:05:37:WU01:FS02:Downloading 61.64MiB
03:05:43:WU01:FS02:Download 22.61%
03:05:49:WU01:FS02:Download 43.09%
(loads 100% at 03:06:05)
 
			
					
				Re: Failing Server - 128.174.73.74:8080
				Posted: Sat Jul 08, 2023 3:47 am
				by SandyG
				Saw a few more today from the .74 server. Still having issues.
Sandy
			 
			
					
				Re: Failing Server - 128.174.73.74:8080
				Posted: Sat Jul 08, 2023 4:47 pm
				by toTOW
				02:40:16:ERROR:WU02:FS03:Exception: Server did not assign work unit
This error have two meanings :
- the server is not setup correctly and can't send the project you've been assigned. Very unlikely on full FAH.
- the server is out of work for the project you've been assigned. It takes a few time to synchronize the WS and the AS on the number of available WUs so it possible that the AS may direct too many clients to a WS until he knows it's out of work ...
 
			
					
				Re: Failing Server - 128.174.73.74:8080
				Posted: Sat Jul 08, 2023 5:47 pm
				by Bsb5068
				same problem to me. i have to kill and load client abain then it can get new WU from same or other server.....
when i don't spy on GPU its cooling for 3-12 hours untill new WU.....
seems like it need to make a demon to control this situation. but problem that other slots of one PC have to stop too. and after some retry it will be dropped too.....so it's not the good solution.
			 
			
					
				Re: Failing Server - 128.174.73.74:8080
				Posted: Sat Jul 08, 2023 5:53 pm
				by SandyG
				toTow -
Seems likely the sever is out of work units, and it take the client some times a long while (seen 25min) to get a work unit, meanwhile my processing is idle. This problem is almost always from this particular server. If I'm having it I'm sure many others are. What needs to be done to increase the share of WU's that are dispensed by this server? It seems that the allocation is ~1000 (as mentioned in an earlier message)but what can be done to help eliminate the idle time waiting for a WU if that is the issue?
Sandy
			 
			
					
				Re: Failing Server - 128.174.73.74:8080
				Posted: Sat Jul 08, 2023 5:55 pm
				by SandyG
				Bsb5068 wrote: ↑Sat Jul 08, 2023 5:47 pm
same problem to me. i have to kill and load client abain then it can get new WU from same or other server.....
when i don't spy on GPU its cooling for 3-12 hours untill new WU.....
seems like it need to make a demon to control this situation. but problem that other slots of one PC have to stop too. and after some retry it will be dropped too.....so it's not the good solution.
 
Yep, I have got hit on multiple GPU's that have hit for WU's around the same time. 
I guess it's good that it reduces my power bill but really doesn't help the needed processing unless FAH has run out of WU to dispense...
Sandy
 
			
					
				Re: Failing Server - 128.174.73.74:8080
				Posted: Sat Jul 08, 2023 6:20 pm
				by [Ars] For Caitlin
				It would be nice if the "Failed to assign" condition was detected and the assignment server round robin'ed among the other servers.  I just got one from .4.210 which isn't even supposed to have work units based on the server status page.
			 
			
					
				Re: Failing Server - 128.174.73.74:8080
				Posted: Sat Jul 08, 2023 8:04 pm
				by SandyG
				[Ars] For Caitlin wrote: ↑Sat Jul 08, 2023 6:20 pm
It would be nice if the "Failed to assign" condition was detected and the assignment server round robin'ed among the other servers.  I just got one from .4.210 which isn't even supposed to have work units based on the server status page.
 
Yeah, it seems to do a progressive delay retry to the same server for a while (it could be random selection, but not sure), then at some point picks up some other server. Not sure what the logic might be but should hit a new server each time if a problem on the last server hit. Might be that there really isn't that many active servers, not sure. 
Hopefully it something like just adjusting the allocation to the .74 server, but someone more plugged in will have to comment. 
Sandy