Page 2 of 3

Re: Can't connect to 18.188.125.154

Posted: Mon Dec 21, 2020 2:02 pm
by STR1D3R_2
There are a lot of us over at evga forums experiencing this as well. We just started our annual Time Zone Challenge on Sunday 12:00 am. :(

Re: Can't connect to 18.188.125.154

Posted: Mon Dec 21, 2020 3:51 pm
by bruce
I think some of you are looking at some problems from the local perspective that really are best seen from the global perspective of one of the FAH servers.

Each work server is independent of the other work servers. Yes, a singe server can have a problem ... like maybe it's RAID fills up. From your perspective, that may seem like a local problem to you since the project(s) running on one of your kit can't upload but another can. In fact, the problem can only be fixed by the server's admin offloading some historic data, leaving empty space for your "failed" kit to upload the completed WUs. Given that most Universities are on holiday schedules, that particular server's admin may not be on-line for a time. The temporary fix may be to suspend new assignments going out from one or two projects ... preventing the problem from growing ... but doing nothing for those who have already completed an assigned WU. :( Collection servers do mitigate certain problems, but they also have finite resources and can be subject to failure, too.

Some problems are, in fact, local problems (like the firewall issue or a blocked outgoing connection) and you can fix that, but if it's a server-based problem, the sever's admin may be the only person who can fix the problem.

Re: Can't connect to 18.188.125.154

Posted: Mon Dec 21, 2020 6:30 pm
by foldingfanmucde
Hi again,

@bruce: thanks for the information and view point.
I guess one of the whole folding community's main strengths may also be one of its main weakness - largely voluntary, no huge funding like within industry, dependency on idividuals and good will / availability .... (don't get me wrong, it's amazing what can and is being achieved and I'm happy to be a tiny part of it with my hardware and electricity)

Further update from my side.

1. A new WU from the same project that previously displayed the issue (Project 17319) has also led to another unsuccessful "send" task.
Now there are 2 in the work queue in status "send".

2. However, when I look up those WUs in the WU stats page, apparently both those WU have been returned and credited - quite close to the time when the jobs were finished.
There is no indication of this when viewing the local log files.

Very strange.

Code: Select all

18:08:46:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:17319 run:0 clone:379 gen:11 core:0x22 unit:0x0000017b0000000b000043a700000000
18:08:46:WU00:FS01:Uploading 24.36MiB to 140.163.4.200
18:08:46:WU00:FS01:Connecting to 140.163.4.200:8080
18:08:47:ERROR:WU01:FS01:Exception: 10002: Received short response, expected 512 bytes, got 0
18:09:17:WARNING:WU00:FS01:Exception: Failed to send results to work server: 10002: Received short response, expected 512 bytes, got 0
18:09:17:WU00:FS01:Trying to send results to collection server
18:09:17:WU00:FS01:Uploading 24.36MiB to 18.188.125.154
18:09:17:WU00:FS01:Connecting to 18.188.125.154:8080
18:09:35:WU02:FS01:0x22:Completed 12500 out of 1250000 steps (1%)
18:09:47:ERROR:WU00:FS01:Exception: 10002: Received short response, expected 512 bytes, got 0
18:09:47:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:17319 run:0 clone:379 gen:11 core:0x22 unit:0x0000017b0000000b000043a700000000
18:09:47:WU00:FS01:Uploading 24.36MiB to 140.163.4.200
18:09:47:WU00:FS01:Connecting to 140.163.4.200:8080
18:10:18:WARNING:WU00:FS01:Exception: Failed to send results to work server: 10002: Received short response, expected 512 bytes, got 0
18:10:18:WU00:FS01:Trying to send results to collection server
18:10:18:WU00:FS01:Uploading 24.36MiB to 18.188.125.154
18:10:18:WU00:FS01:Connecting to 18.188.125.154:8080
18:10:48:ERROR:WU00:FS01:Exception: 10002: Received short response, expected 512 bytes, got 0
18:11:04:WU02:FS01:0x22:Completed 25000 out of 1250000 steps (2%)
18:11:05:WU02:FS01:0x22:Checkpoint completed at step 25000
18:11:25:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:17319 run:0 clone:379 gen:11 core:0x22 unit:0x0000017b0000000b000043a700000000

Re: Can't connect to 18.188.125.154

Posted: Mon Dec 21, 2020 9:01 pm
by Knish
foldingfanmucde wrote: 2. However, when I look up those WUs in the WU stats page, apparently both those WU have been returned and credited - quite close to the time when the jobs were finished.
There is no indication of this when viewing the local log files.
Same: when I looked back, apps/wu showed credit at 0405 Z so checking logs around that time yielded

Code: Select all

03:58:10:WU01:FS01:Connecting to 18.188.125.154:80
^[[93m03:58:10:WARNING:WU01:FS01:Exception: Failed to send results to work server: Failed to connect to 18.188.125.154:80: Connection refused^[[0m
03:58:10:WU01:FS01:Trying to send results to collection server
03:58:10:WU01:FS01:Uploading 8.50MiB to 150.136.14.110
03:58:10:WU01:FS01:Connecting to 150.136.14.110:8080
03:58:21:WU00:FS00:0xa8:Completed 405000 out of 500000 steps (81%)
^[[93m03:58:41:WARNING:WU01:FS01:WorkServer connection failed on port 8080 trying 80^[[0m
03:58:41:WU01:FS01:Connecting to 150.136.14.110:80
03:58:57:WU01:FS01:Upload 0.74%
03:59:05:WU00:FS00:0xa8:Completed 410000 out of 500000 steps (82%)
03:59:34:WU01:FS01:Upload 16.18%
and that's the last entry from WU01 and nothing but xa8 entries from there until i rebooted

Re: Can't connect to 18.188.125.154

Posted: Mon Dec 21, 2020 9:08 pm
by foldingfanmucde
@knish: unfortunately rebooting has not changed the situation for me. Those "send" jobs just come back and stay there persistently.
I wonder if they will give up and disappear when the credit has counted down to zero...


... I'm also not sure if this behaviour is not possibly somehow related to F@HClient V7.6.21, which I have only been using for a couple of days on this one system.
This new version does not allow for changing or self configuring the IDs / addresses of the GPU within the configuration, as is the case for example with 7.6.13 (my other system).
That sometimes proved useful for working around some strange issues, especially when using multiple GPUs in one system.
But it's just a feeling I have - no hard evidence (yet).

Re: Can't connect to 18.188.125.154

Posted: Tue Dec 22, 2020 8:40 am
by PantherX
foldingfanmucde wrote:...[Just a hypothesis, but could it be, that larger result files (24MB) are an issue and smaller ones (7,7MB) are not?
(My internet connection is 250 Mbit/s downlink und 50Mbits uplink so that is not likely an issue)]
=> Edit: I've just seen another folder (rexts217) has reported the same issue with file sizes ~8,5MB in another parallel thread, so it seems my hypothesis is invalid.
Ref: viewtopic.php?f=18&t=36591...
It seems that you might be having the same "issue" as rexts217 where the router/firewall drops the last ACK packet. While I was able to confirm for rexts217, since you haven't posted what donor name you are currently folding under, I assume that these would be yours:
project:17319 run:0 clone:107 gen:20 https://apps.foldingathome.org/wu#proje ... 107&gen=20
project:17319 run:0 clone:379 gen:11 https://apps.foldingathome.org/wu#proje ... 379&gen=11
foldingfanmucde wrote:...2. However, when I look up those WUs in the WU stats page, apparently both those WU have been returned and credited - quite close to the time when the jobs were finished.
There is no indication of this when viewing the local log files...
Is that the log file as soon the WU was completed? If you have fast upload speed, you may not get any upload percentages.
foldingfanmucde wrote:...I wonder if they will give up and disappear when the credit has counted down to zero...
The client will delete completed WUs under either condition:
It was successfully uploaded to the Server
The final deadline was reached

Assuming that those WUs were returned by you, you could delete the correct folder locally and prevent the unwanted uploads.

Re: Can't connect to 18.188.125.154

Posted: Tue Dec 22, 2020 10:48 am
by foldingfanmucde
@PantherX
Assuming that those WUs were returned by you, you could delete the correct folder locally and prevent the unwanted uploads
Thanks for your analysis and the information you provided. Yes, those WUs are the ones.
At the moment there are three in status "send" and increasing by about one or two per day. But since the timeout is only a few days, then the list will not get too long.
Up until now, for every single "send" job I was able to verify that the WU was returned successfully, by using the WU status checker.

One question: Can the WU folder be deleted whilst actively folding, or should the client be stopped first?

Many thanks!

Re: Can't connect to 18.188.125.154

Posted: Thu Dec 24, 2020 8:04 am
by PantherX
foldingfanmucde wrote:...One question: Can the WU folder be deleted whilst actively folding, or should the client be stopped first?...
You can delete the corresponding folder without the need to stop the client. The error messages that would be printed in the log file can be ignored. Here's the example where I paused the slot, deleted the folder and unpaused the slot:

Code: Select all

06:13:58:WU01:FS01:0x22:Completed 2350000 out of 5000000 steps (47%)
06:14:01:FS01:Paused
06:14:01:FS01:Shutting core down
06:14:01:WU01:FS01:0x22:WARNING:Console control signal 1 on PID 12576
06:14:01:WU01:FS01:0x22:Exiting, please wait. . .
06:14:01:WU01:FS01:0x22:Folding@home Core Shutdown: INTERRUPTED
06:14:02:WU01:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
...
06:53:20:FS01:Unpaused
06:53:20:WU01:FS01:Starting
06:53:20:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\PantherX-H\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.13/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 706 -lifeline 11896 -checkpoint 30 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
06:53:20:WU01:FS01:Started FahCore on PID 13612
06:53:20:WU01:FS01:Core PID:17900
06:53:20:WU01:FS01:FahCore 0x22 started
06:53:21:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
06:53:21:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:.... run:1 clone:7 gen:0 core:0x22 unit:0x00000007000000000000347a00000001
06:53:21:WU01:FS01:Uploading 5.00KiB to 18.188.125.154
06:53:21:WU01:FS01:Connecting to 18.188.125.154:8080
06:53:21:WU01:FS01:Upload complete
06:53:21:WU00:FS01:Connecting to assign1.foldingathome.org:80
06:53:22:WU01:FS01:Server responded WORK_QUIT (404)
06:53:22:WARNING:WU01:FS01:Server did not like results, dumping
06:53:22:WU01:FS01:Cleaning up
06:53:23:WU00:FS01:Assigned to work server 206.223.170.146
06:53:23:WU00:FS01:Requesting new work unit for slot 01: READY gpu:0:GP102 [GeForce GTX 1080 Ti] 11380 from 206.223.170.146
06:53:23:WU00:FS01:Connecting to 206.223.170.146:8080
06:53:24:WU00:FS01:Downloading 13.74MiB
06:53:30:WU00:FS01:Download 30.01%
06:53:36:WU00:FS01:Download 75.03%
06:53:39:WU00:FS01:Download complete
06:53:39:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:.... run:0 clone:0 gen:2 core:0x22 unit:0x00000000000000020000441700000000
06:53:39:WU00:FS01:Starting
06:53:39:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\PantherX-H\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.13/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 11896 -checkpoint 30 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
06:53:39:WU00:FS01:Started FahCore on PID 12376
06:53:39:WU00:FS01:Core PID:19176
06:53:39:WU00:FS01:FahCore 0x22 started
06:53:39:WU00:FS01:0x22:*********************** Log Started 2020-12-24T06:53:39Z ***********************
06:53:39:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
06:53:39:WU00:FS01:0x22:       Core: Core22
06:53:39:WU00:FS01:0x22:       Type: 0x22
06:53:39:WU00:FS01:0x22:    Version: 0.0.13
06:53:39:WU00:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
06:53:39:WU00:FS01:0x22:  Copyright: 2020 foldingathome.org
06:53:39:WU00:FS01:0x22:   Homepage: https://foldingathome.org/
06:53:39:WU00:FS01:0x22:       Date: Sep 19 2020
06:53:39:WU00:FS01:0x22:       Time: 02:35:58
06:53:39:WU00:FS01:0x22:   Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
06:53:39:WU00:FS01:0x22:     Branch: core22-0.0.13
06:53:39:WU00:FS01:0x22:   Compiler: Visual C++ 2015
06:53:39:WU00:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
06:53:39:WU00:FS01:0x22:             -DOPENMM_GIT_HASH="\"189320d0\""
06:53:39:WU00:FS01:0x22:   Platform: win32 10
06:53:39:WU00:FS01:0x22:       Bits: 64
06:53:39:WU00:FS01:0x22:       Mode: Release
06:53:39:WU00:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
06:53:39:WU00:FS01:0x22:             <peastman@stanford.edu>
06:53:39:WU00:FS01:0x22:       Args: -dir 00 -suffix 01 -version 706 -lifeline 12376 -checkpoint 30
06:53:39:WU00:FS01:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
06:53:39:WU00:FS01:0x22:             0 -gpu 0
06:53:39:WU00:FS01:0x22:************************************ libFAH ************************************
06:53:39:WU00:FS01:0x22:       Date: Sep 7 2020
06:53:39:WU00:FS01:0x22:       Time: 19:09:56
06:53:39:WU00:FS01:0x22:   Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
06:53:39:WU00:FS01:0x22:     Branch: HEAD
06:53:39:WU00:FS01:0x22:   Compiler: Visual C++ 2015
06:53:39:WU00:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
06:53:39:WU00:FS01:0x22:   Platform: win32 10
06:53:39:WU00:FS01:0x22:       Bits: 64
06:53:39:WU00:FS01:0x22:       Mode: Release
06:53:39:WU00:FS01:0x22:************************************ CBang *************************************
06:53:39:WU00:FS01:0x22:       Date: Sep 7 2020
06:53:39:WU00:FS01:0x22:       Time: 19:08:30
06:53:39:WU00:FS01:0x22:   Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
06:53:39:WU00:FS01:0x22:     Branch: HEAD
06:53:39:WU00:FS01:0x22:   Compiler: Visual C++ 2015
06:53:39:WU00:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
06:53:39:WU00:FS01:0x22:   Platform: win32 10
06:53:39:WU00:FS01:0x22:       Bits: 64
06:53:39:WU00:FS01:0x22:       Mode: Release
06:53:39:WU00:FS01:0x22:************************************ System ************************************
06:53:39:WU00:FS01:0x22:        CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
06:53:39:WU00:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 94 Stepping 3
06:53:39:WU00:FS01:0x22:       CPUs: 8
06:53:39:WU00:FS01:0x22:     Memory: 31.94GiB
06:53:39:WU00:FS01:0x22:Free Memory: 17.53GiB
06:53:39:WU00:FS01:0x22:    Threads: WINDOWS_THREADS
06:53:39:WU00:FS01:0x22: OS Version: 6.2
06:53:39:WU00:FS01:0x22:Has Battery: false
06:53:39:WU00:FS01:0x22: On Battery: false
06:53:39:WU00:FS01:0x22: UTC Offset: 13
06:53:39:WU00:FS01:0x22:        PID: 19176
06:53:39:WU00:FS01:0x22:        CWD: C:\Users\PantherX-H\AppData\Roaming\FAHClient\work
06:53:39:WU00:FS01:0x22:************************************ OpenMM ************************************
06:53:39:WU00:FS01:0x22:   Revision: 189320d0
06:53:39:WU00:FS01:0x22:********************************************************************************
06:53:39:WU00:FS01:0x22:Project: .... (Run 0, Clone 0, Gen 2)
06:53:39:WU00:FS01:0x22:Unit: 0x00000000000000000000000000000000
06:53:39:WU00:FS01:0x22:Reading tar file core.xml
06:53:39:WU00:FS01:0x22:Reading tar file integrator.xml.bz2
06:53:39:WU00:FS01:0x22:Reading tar file state.xml.bz2
06:53:39:WU00:FS01:0x22:Reading tar file system.xml.bz2
06:53:39:WU00:FS01:0x22:Digital signatures verified
06:53:39:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
06:53:39:WU00:FS01:0x22:Version 0.0.13
06:53:39:WU00:FS01:0x22:  Checkpoint write interval: 25000 steps (2%) [50 total]
06:53:39:WU00:FS01:0x22:  JSON viewer frame write interval: 12500 steps (1%) [100 total]
06:53:39:WU00:FS01:0x22:  XTC frame write interval: 10000 steps (0.8%) [125 total]
06:53:39:WU00:FS01:0x22:  Global context and integrator variables write interval: disabled
06:53:39:WU00:FS01:0x22:There are 4 platforms available.
06:53:39:WU00:FS01:0x22:Platform 0: Reference
06:53:39:WU00:FS01:0x22:Platform 1: CPU
06:53:39:WU00:FS01:0x22:Platform 2: OpenCL
06:53:39:WU00:FS01:0x22:  opencl-device 0 specified
06:53:39:WU00:FS01:0x22:Platform 3: CUDA
06:53:39:WU00:FS01:0x22:  cuda-device 0 specified
...
07:00:10:WU00:FS01:0x22:Attempting to create CUDA context:
07:00:10:WU00:FS01:0x22:  Configuring platform CUDA
07:01:04:WU00:FS01:0x22:  Using CUDA and gpu 0
07:01:04:WU00:FS01:0x22:Completed 0 out of 1250000 steps (0%)

Re: Can't connect to 18.188.125.154

Posted: Thu Dec 24, 2020 4:18 pm
by foldingfanmucde
@PantherX
Thank you for the instructions!
Greetings to you in NZ from an ex-pat Aussie in DE. ;-)
Seasons greetings and all the best for 2021.
Stay safe.

Re: Can't connect to 18.188.125.154

Posted: Thu Dec 24, 2020 10:10 pm
by bruce
I can think of two more possibilities which somebody needs to check out.

1) Is this somehow related to an IPV6 change? FAH has been pretty closely tied to IPV4 and I don't see any suggestions associated with a migration any time soon.
2) Is this somehow releated to a http: vs https: issue?

Either might be causing the problem and they're both just out of the visible issues that we typically look for.

Re: Can't connect to 18.188.125.154

Posted: Fri Dec 25, 2020 3:52 am
by JohnChodera
Apologies for all the difficulties here! Our lead developer was working on resolving some weird behavior with the server code on aws3.foldingathome.org, and appears to have solved these issues on Dec 21. aws3 now has a deployed version of the new debug code with the fix.

Please do let us know if you continue to experience difficulties!

~ John Chodera // MSKCC

Re: Can't connect to 18.188.125.154

Posted: Sat Dec 26, 2020 6:31 pm
by Knish
looks like someone on reddit with a 3090 is still having issues

Re: Can't connect to 18.188.125.154

Posted: Sun Dec 27, 2020 12:19 am
by bruce
Knish wrote:looks like someone on reddit with a 3090 is still having issues
Tell him to give us a complete report of the problem.

According to serverstats, the server has no jobs to distribute and it happily opens it's landing page for me. Presumably the person(s) with problems is trying to use it as a Collection Server, but a CS is only used as a secondary connection to the primary Work Server. We might be able to hellp if we knew which WS is failing?

We can't read their mind or help with an incomplete report of a problem. Solving undefined problems with 18.188.125.154 which doesn't happen to be a critical resource can be very frustrating for us.

Re: Can't connect to 18.188.125.154

Posted: Sun Dec 27, 2020 11:25 am
by Synt0xx
Hello,

this is my first post in this forum!

I'm the guy with the 3090 you are talking about. My issues with this server are that I have a lot of packet loss to those servers.

This was from a test yesterday where I was trying to ping your WS to check for packet loss with this result, the other WS had no packet loss:

--- 129.213.40.229 ping statistics ---

100 packets transmitted, 0 received, 100% packet loss, time 98899ms


--- 129.32.209.207 ping statistics ---

100 packets transmitted, 99 received, 1% packet loss, time 99105ms

rtt min/avg/max/mdev = 114.448/119.532/206.739/15.322 ms


--- 150.136.14.110 ping statistics ---

100 packets transmitted, 0 received, 100% packet loss, time 99007ms


--- 128.174.73.74 ping statistics ---

100 packets transmitted, 0 received, 100% packet loss, time 99083ms


--- 140.163.4.210 ping statistics ---

100 packets transmitted, 0 received, 100% packet loss, time 99005ms


especially this server makes a lot of trouble where I loose a lot of packets:

--- 18.188.125.154 ping statistics ---

100 packets transmitted, 88 received, 12% packet loss, time 99088ms

rtt min/avg/max/mdev = 121.814/122.896/135.077/2.106 ms


--- 140.163.4.200 ping statistics ---

100 packets transmitted, 0 received, 100% packet loss, time 99015ms

Today this happens:

11:22:49:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:17316 run:0 clone:2592 gen:40 core:0x22 unit:0x00000a2000000028000043a400000000
11:22:49:WU00:FS01:Uploading 18.97MiB to 140.163.4.200
11:22:49:WU00:FS01:Connecting to 140.163.4.200:8080
11:23:20:WARNING:WU00:FS01:Exception: Failed to send results to work server: 10002: Received short response, expected 512 bytes, got 0
11:23:20:WU00:FS01:Trying to send results to collection server
11:23:20:WU00:FS01:Uploading 18.97MiB to 140.163.4.210
11:23:20:WU00:FS01:Connecting to 140.163.4.210:8080

and the packet loss here is 100% on both CS and WS:

syntox@Elaine:/mnt/c/Users/Fabia$ ping -c 100 140.163.4.200
PING 140.163.4.200 (140.163.4.200) 56(84) bytes of data.
^C
--- 140.163.4.200 ping statistics ---
12 packets transmitted, 0 received, 100% packet loss, time 11007ms

syntox@Elaine:/mnt/c/Users/Fabia$ ping -c 100 140.163.4.210
PING 140.163.4.210 (140.163.4.210) 56(84) bytes of data.
^C
--- 140.163.4.210 ping statistics ---
9 packets transmitted, 0 received, 100% packet loss, time 8086ms

Re: Can't connect to 18.188.125.154

Posted: Sun Dec 27, 2020 11:33 am
by Synt0xx
I have exact the same issues for a week now! Every upload gets flagged as failed, Packet loss to the servers is high but eventually the WU gets successfully uploaded, but I never get the WORK_ACK... so FAHclient endlessly resends if I don't delete the WUs. (after I checked if they were received)