Multiple WU's Fail downld/upld to 155.247.166.*

Moderators: Site Moderators, FAHC Science Team

JimF
Posts: 651
Joined: Thu Jan 21, 2010 2:03 pm

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by JimF »

My last two machines are now down. If they have to run the work units in order to clear them out, but if they cause everyone to go down in the process, it appears to be a classic Catch 22.
Let me know when they figure it out.

(I was wondering about Stanford - thanks for the update. I did not know they had turned over the load entirely).
Paragon
Posts: 137
Joined: Fri Oct 21, 2011 3:24 am
Hardware configuration: Rig1 (Dedicated SMP): AMD Phenom II X6 1100T, Gigabyte GA-880GMA-USB3 board, 8 GB Kingston 1333 DDR3 Ram, Seasonic S12 II 380 Watt PSU, Noctua CPU Cooler

Rig2 (Part-Time GPU): Intel Q6600, Gigabyte 965P-S3 Board, EVGA 460 GTX Graphics, 8 GB Kingston 800 DDR2 Ram, Seasonic Gold X-650 PSU, Artic Cooling Freezer 7 CPU Cooler
Location: United States

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by Paragon »

I can confirm that 219 is still down...took out two of my four machines today. I just rebooted one machine 5 times and it got stuck each time...although the last attempt actually threw an error and then it switched servers.

Code: Select all

02:48:54:WU00:FS01:Assigned to work server 155.247.166.219
02:48:54:WU00:FS01:Requesting new work unit for slot 01: READY gpu:0:Ellesmere XT [Radeon RX 470/480/570/580] from 155.247.166.219
02:48:54:WU00:FS01:Connecting to 155.247.166.219:8080
02:48:55:WU00:FS01:Downloading 27.46MiB
02:49:04:WU00:FS01:Download 0.46%
02:50:58:WU00:FS01:Download 0.64%
02:50:58:ERROR:WU00:FS01:Exception: Transfer failed
02:50:58:WU00:FS01:Connecting to 65.254.110.245:8080
02:50:58:WU00:FS01:Assigned to work server 155.247.166.220
02:50:58:WU00:FS01:Requesting new work unit for slot 01: READY gpu:0:Ellesmere XT [Radeon RX 470/480/570/580] from 155.247.166.220
02:50:58:WU00:FS01:Connecting to 155.247.166.220:8080
02:50:59:WU00:FS01:Downloading 15.63MiB
02:51:04:WU00:FS01:Download complete
HaloJones
Posts: 906
Joined: Thu Jul 24, 2008 10:16 am

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by HaloJones »

Strange in this day and age that this isn't all virtualised and running off AWS or Azure.
single 1070

Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by bruce »

That download was from *.220, not *219 ... but the activity levels look normal on serverstat based on recent updates.
Joe_H
Site Admin
Posts: 7938
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by Joe_H »

HaloJones wrote:Strange in this day and age that this isn't all virtualised and running off AWS or Azure.
Some parts of F@h have already been moved to cloud services, and as I understand it, more will be in the future. But that takes programming, time and money. An example of one that is in the cloud, if you look at the server stats page for assign2, the IP address listed is one in a private address range. Its actual hosting is on one of Amazon's servers.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
MeeLee
Posts: 1339
Joined: Tue Feb 19, 2019 10:16 pm

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by MeeLee »

They could reduce the size of WUs from 155.247.*, make it get a lower load.

I think it's valuable to have this server upload as few WUs as possible to consistent clients (clients or users that process many WUs fast).
A slow WU would make little difference for someone who intermittently folds, but is a huge pain for someone with a server to maintain every time one of his GPUs is down.
Joe_H
Site Admin
Posts: 7938
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by Joe_H »

The size of a WU from an existing project can not be changed. Ultimately it will take more serves to spread the load, and for the WU's that are being processed on the older A7 core to finish being returned.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by bruce »

MeeLee wrote:They could reduce the size of WUs from 155.247.*, make it get a lower load.
No.

We could reduce the load by having the servers tell clients "No WUs for your client's configuration" but that's not an acceptable solution.

Running a single WU that takes (say) 4 hours will download and upload the same amount of data as two WUs that take 2 hours each, Then, too, two WUs have to make another upload connection and another download connection. Nothing is gained by making smaller WUs.

Studying half of a protein is useless -- you have to study the whole protein.
MeeLee
Posts: 1339
Joined: Tue Feb 19, 2019 10:16 pm

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by MeeLee »

Sure, but it prevents you from having GPUs idle.
The size of WUs can't be changed, but larger WUs are assigned to certain GPUs; and smaller WUs are assigned to older GPUs or CPUs.
You just want to prevent your biggest contributor to be idle (fast GPUs).
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by bruce »

A Project can have fewer atoms or more atoms. A project can be processed for N steps or for 2N steps or 0.5N steps. Changing the atom count is not poassible. Changing the number of steps is possible but only when the project is first constructed.
semi-
Changes in atom counts AND changes in steps are both commonly called larger/smaller WUs because changing either one changes the PPD.

Assigning specific projects can be permitted for certain GPU-Species and restricted from other specific GPU-Species. Those restrictions are rigid rules for the Assignment process. If your GPU needs an assignment, it will be assigned something from the pool of active projects permitted for your Species or it won't. There are no second choices that can be assigned only when the list of first choices happens to be empty.
dfgirl12
Posts: 38
Joined: Fri Aug 21, 2009 8:34 am

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by dfgirl12 »

GPU WU downloads from 155.247.166.219 are slow and failing again (hanging folding slots). It looks like it's time for a server reboot, since it's at 5 days of up time.
HaloJones
Posts: 906
Joined: Thu Jul 24, 2008 10:16 am

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by HaloJones »

Sod's Law says it's my fastest cards that get stuck on these servers. Please get it re-started.
single 1070

Image
DocJonz
Posts: 244
Joined: Thu Dec 06, 2007 6:31 pm
Hardware configuration: Folding with: 4x RTX 4070Ti, 1x RTX 4080 Super
Location: United Kingdom
Contact:

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by DocJonz »

Yep, I'm seeing the same issues (again) across multiple GPU machines related to this server group.
Folding Stats (HFM.NET): DocJonz Folding Farm Stats
prcowley
Posts: 28
Joined: Thu Jan 03, 2019 11:03 pm
Hardware configuration: Op Sys: Linux Ubuntu Studio 24.04 LTS
Kernal: 6.8.0-45-lowlatency (64-bit)
Proc: 16x AMD Ryzen 7 7800X3D 8-Core Processor
Mem: 32 GB
GPU: NVIDIA GeForce RTX 4080 SUPER/PCIe/SSE2
Location: Gisborne, New Zealand
Contact:

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by prcowley »

I am also having problems with this IP and it causes the download of a new work unit to start slowly and then stall completely.

This only started yesterday for me but twice now I have had to reboot my Ubuntu machine to get it to download the WU.

Code: Select all

07:56:15:Trying to access database...
07:56:15:Successfully acquired database lock
07:56:15:Enabled folding slot 00: READY gpu:0:GP102 [GeForce GTX 1080 Ti] 11380
07:56:15:ERROR:WU00:FS00:Exception: Could not get IP address for assign1.foldingathome.org: Name or service not known
07:56:15:ERROR:WU00:FS00:Exception: Could not get IP address for assign2.foldingathome.org: Name or service not known
07:56:15:WARNING:WU00:FS00:Exception: Failed to find any IP addresses for assignment servers
07:56:15:ERROR:WU00:FS00:Exception: Could not get an assignment
07:56:15:ERROR:WU00:FS00:Exception: Could not get IP address for assign1.foldingathome.org: Name or service not known
07:56:15:ERROR:WU00:FS00:Exception: Could not get IP address for assign2.foldingathome.org: Name or service not known
07:56:15:WARNING:WU00:FS00:Exception: Failed to find any IP addresses for assignment servers
07:56:15:ERROR:WU00:FS00:Exception: Could not get an assignment
07:57:15:WU00:FS00:Connecting to 65.254.110.245:8080
07:57:16:WU00:FS00:Assigned to work server 155.247.166.220
07:57:16:WU00:FS00:Requesting new work unit for slot 00: READY gpu:0:GP102 [GeForce GTX 1080 Ti] 11380 from 155.247.166.220
07:57:16:WU00:FS00:Connecting to 155.247.166.220:8080
07:57:17:WU00:FS00:Downloading 15.85MiB
07:57:23:WU00:FS00:Download 11.83%
07:57:29:WU00:FS00:Download 80.81%
07:57:30:WU00:FS00:Download complete
07:57:30:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:14180 run:10 clone:672 gen:103 core:0x21 unit:0x000000a00002894c5d3b55fff338ea10
07:57:30:WU00:FS00:Starting
07:57:30:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_21.fah/FahCore_21 -dir 00 -suffix 01 -version 705 -lifeline 2088 -checkpoint 5 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
07:57:30:WU00:FS00:Started FahCore on PID 3958
07:57:30:WU00:FS00:Core PID:3962
07:57:30:WU00:FS00:FahCore 0x21 started
07:57:31:WU00:FS00:0x21:*********************** Log Started 2019-11-19T07:57:30Z ***********************
07:57:31:WU00:FS00:0x21:Project: 14180 (Run 10, Clone 672, Gen 103)
07:57:31:WU00:FS00:0x21:Unit: 0x000000a00002894c5d3b55fff338ea10
07:57:31:WU00:FS00:0x21:CPU: 0x00000000000000000000000000000000
07:57:31:WU00:FS00:0x21:Machine: 0
07:57:31:WU00:FS00:0x21:Reading tar file core.xml
07:57:31:WU00:FS00:0x21:Reading tar file integrator.xml
07:57:31:WU00:FS00:0x21:Reading tar file state.xml
07:57:31:WU00:FS00:0x21:Reading tar file system.xml
07:57:31:WU00:FS00:0x21:Digital signatures verified
07:57:31:WU00:FS00:0x21:Folding@home GPU Core21 Folding@home Core
It is strange that a reboot fixes it for often quite some time and then it happens again intermittently.

Very strange indeed.
Cheers
Pete
Pete Cowley, Gisborne, New Zealand. The first city to see the light of the new day. :D
Image
rwh202
Posts: 410
Joined: Mon Nov 15, 2010 8:51 pm
Hardware configuration: 8x GTX 1080
3x GTX 1080 Ti
3x GTX 1060
Various other bits and pieces
Location: South Coast, UK

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Post by rwh202 »

Code: Select all

11:07:02:WU00:FS01:Connecting to 65.254.110.245:8080
11:07:03:WU00:FS01:Assigned to work server 155.247.166.220
11:07:03:WU00:FS01:Requesting new work unit for slot 01: READY gpu:0:TU116 [GeForce GTX 1660] from 155.247.166.220
11:07:03:WU00:FS01:Connecting to 155.247.166.220:8080
11:07:03:WU00:FS01:Downloading 6.06MiB
11:07:10:WU00:FS01:Download 3.09%
11:07:17:WU00:FS01:Download 6.18%
11:07:24:WU00:FS01:Download 9.28%
11:07:30:WU00:FS01:Download 11.34%
11:10:05:WU00:FS01:Download 13.40%
Yep, this is getting boring. Pull the server and fix the client so it aborts stalled downloads.
Post Reply