Page 1 of 2

3.21.157.11 overloaded?

Posted: Tue Aug 11, 2020 10:21 am
by Hopfgeist
I am having big troubles with 3.21.157.11.

One client is not getting any work units from it, despite initially connecting. After establishing the connection it sits there literally for hours and nothing happens.

Code: Select all

*********************** Log Started 2020-08-11T08:38:46Z ***********************
08:38:46:Trying to access database...
08:38:46:Successfully acquired database lock
08:38:46:Read GPUs.txt
08:38:46:Enabled folding slot 00: READY cpu:3
08:38:46:****************************** FAHClient ******************************
08:38:46:    Version: 7.6.13
08:38:46:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
08:38:46:  Copyright: 2020 foldingathome.org
08:38:46:   Homepage: https://foldingathome.org/
08:38:46:       Date: Apr 27 2020
08:38:46:       Time: 21:20:45
08:38:46:   Revision: 5a652817f46116b6e135503af97f18e094414e3b
08:38:46:     Branch: master
08:38:46:   Compiler: GNU 4.2.1 Compatible Apple LLVM 11.0.0 (clang-1100.0.33.8)
08:38:46:    Options: -std=c++11 -O3 -funroll-loops -mmacosx-version-min=10.7
08:38:46:             -Wno-unused-local-typedefs -stdlib=libc++
08:38:46:   Platform: darwin 19.2.0
08:38:46:       Bits: 64
08:38:46:       Mode: Release
08:38:46:       Args: --user=XXXXX --team=XXXXX
08:38:46:             --passkey=******************************** --gpu=false --smp=true
08:38:46:             --cpus=3 --chdir /Users/bernd/FAH --log-color=false --password
08:38:46:             ******** --pause-on-start=false
08:38:46:     Config: /Users/bernd/FAH/config.xml
08:38:46:******************************** CBang ********************************
08:38:46:       Date: Apr 24 2020
08:38:46:       Time: 17:07:50
08:38:46:   Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
08:38:46:     Branch: master
08:38:46:   Compiler: GNU 4.2.1 Compatible Apple LLVM 11.0.0 (clang-1100.0.33.8)
08:38:46:    Options: -std=c++11 -O3 -funroll-loops -mmacosx-version-min=10.7
08:38:46:             -Wno-unused-local-typedefs -stdlib=libc++ -fPIC
08:38:46:   Platform: darwin 19.2.0
08:38:46:       Bits: 64
08:38:46:       Mode: Release
08:38:46:******************************* System ********************************
08:38:46:        CPU: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
08:38:46:     CPU ID: GenuineIntel Family 6 Model 58 Stepping 9
08:38:46:       CPUs: 4
08:38:46:     Memory: 8.00GiB
08:38:46:Free Memory: 18.86MiB
08:38:46:    Threads: POSIX_THREADS
08:38:46: OS Version: 10.13
08:38:46:Has Battery: false
08:38:46: On Battery: false
08:38:46: UTC Offset: 2
08:38:46:        PID: 18009
08:38:46:        CWD: /Users/bernd
08:38:46:         OS: Darwin 17.7.0 x86_64
08:38:46:    OS Arch: AMD64
08:38:46:       GPUs: 0
08:38:46:       CUDA: Not detected: Failed to open dynamic library 'libcuda.dylib':
08:38:46:             dlopen(libcuda.dylib, 1): image not found
08:38:46:     OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.dylib':
08:38:46:             dlopen(libOpenCL.dylib, 1): image not found
08:38:46:******************************* libFAH ********************************
08:38:46:       Date: Apr 15 2020
08:38:46:       Time: 14:43:28
08:38:46:   Revision: 216968bc7025029c841ed6e36e81a03a316890d3
08:38:46:     Branch: master
08:38:46:   Compiler: GNU 4.2.1 Compatible Apple LLVM 11.0.0 (clang-1100.0.33.8)
08:38:46:    Options: -std=c++11 -O3 -funroll-loops -mmacosx-version-min=10.7
08:38:46:             -Wno-unused-local-typedefs -stdlib=libc++
08:38:46:   Platform: darwin 19.2.0
08:38:46:       Bits: 64
08:38:46:       Mode: Release
08:38:46:***********************************************************************
08:38:46:<config>
08:38:46:  <!-- Network -->
08:38:46:  <proxy v=':8080'/>
08:38:46:
08:38:46:  <!-- Work Unit Control -->
08:38:46:  <next-unit-percentage v='100'/>
08:38:46:
08:38:46:  <!-- Folding Slots -->
08:38:46:  <slot id='0' type='CPU'/>
08:38:46:</config>
08:38:46:WU00:FS00:Connecting to assign1.foldingathome.org:80
08:38:47:WU00:FS00:Assigned to work server 69.94.66.7
08:38:47:WU00:FS00:Requesting new work unit for slot 00: READY cpu:3 from 69.94.66.7
08:38:47:WU00:FS00:Connecting to 69.94.66.7:8080
08:38:47:ERROR:WU00:FS00:Exception: Server did not assign work unit
08:38:48:WU00:FS00:Connecting to assign1.foldingathome.org:80
08:38:48:WU00:FS00:Assigned to work server 3.21.157.11
08:38:48:WU00:FS00:Requesting new work unit for slot 00: READY cpu:3 from 3.21.157.11
08:38:48:WU00:FS00:Connecting to 3.21.157.11:8080
08:39:08:ERROR:WU00:FS00:Exception: 10002: Received short response, expected 512 bytes, got 0
08:39:48:WU00:FS00:Connecting to assign1.foldingathome.org:80
08:39:48:WARNING:WU00:FS00:Failed to get assignment from 'assign1.foldingathome.org:80': No WUs available for this configuration
08:39:48:WU00:FS00:Connecting to assign2.foldingathome.org:80
08:39:49:WU00:FS00:Assigned to work server 3.21.157.11
08:39:49:WU00:FS00:Requesting new work unit for slot 00: READY cpu:3 from 3.21.157.11
08:39:49:WU00:FS00:Connecting to 3.21.157.11:8080
The log file was grabbed at 10:15Z, so it had been sitting in this state for over an hour and a half.


Another client is unable to upload a finished unit to 3.21.157.11:

Code: Select all

*********************** Log Started 2020-08-11T06:22:00Z ***********************
06:22:00:Trying to access database...
06:22:00:Successfully acquired database lock
06:22:00:Read GPUs.txt
06:22:00:WARNING:Exception: Failed to open '/proc/bus/pci/devices': Failed to open '/proc/bus/pci/devices': No such file or directory: iostream error: No such file or directory
06:22:00:Enabled folding slot 00: READY cpu:24
06:22:00:****************************** FAHClient ******************************
06:22:00:    Version: 7.6.13
06:22:00:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
06:22:00:  Copyright: 2020 foldingathome.org
06:22:00:   Homepage: https://foldingathome.org/
06:22:00:       Date: Apr 28 2020
06:22:00:       Time: 04:20:16
06:22:00:   Revision: 5a652817f46116b6e135503af97f18e094414e3b
06:22:00:     Branch: master
06:22:00:   Compiler: GNU 8.3.0
06:22:00:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
06:22:00:             -fno-pie
06:22:00:   Platform: linux2 4.19.0-5-amd64
06:22:00:       Bits: 64
06:22:00:       Mode: Release
06:22:00:       Args: --user=XXXXX --team=XXXXX
06:22:00:             --passkey=******************************** --gpu=false --smp=true
06:22:00:             --cpus=24 --log-color=false --allow=127.0.0.1 192.168.1.0/24
06:22:00:             --web-allow=127.0.0.1 192.168.1.0/24 --chdir /home/bernd/FAH
06:22:00:     Config: /home/bernd/FAH/config.xml
06:22:00:******************************** CBang ********************************
06:22:00:       Date: Apr 25 2020
06:22:00:       Time: 00:07:53
06:22:00:   Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
06:22:00:     Branch: master
06:22:00:   Compiler: GNU 8.3.0
06:22:00:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
06:22:00:             -fno-pie -fPIC
06:22:00:   Platform: linux2 4.19.0-5-amd64
06:22:00:       Bits: 64
06:22:00:       Mode: Release
06:22:00:******************************* System ********************************
06:22:00:        CPU: Intel(R) Xeon(R) CPU X5675 @ 3.07GHz
06:22:00:     CPU ID: GenuineIntel Family 6 Model 44 Stepping 2
06:22:00:       CPUs: 24
06:22:00:     Memory: 39.99GiB
06:22:00:Free Memory: 2.22GiB
06:22:00:    Threads: POSIX_THREADS
06:22:00: OS Version: 3.11
06:22:00:Has Battery: false
06:22:00: On Battery: false
06:22:00: UTC Offset: 2
06:22:00:        PID: 26760
06:22:00:        CWD: /home/bernd
06:22:00:         OS: Linux 3.11.6 x86_64
06:22:00:    OS Arch: AMD64
06:22:00:       GPUs: 0
06:22:00:       CUDA: Not detected: Failed to open dynamic library 'libcuda.so':
06:22:00:             libcuda.so: cannot open shared object file: No such file or
06:22:00:             directory
06:22:00:     OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so':
06:22:00:             libOpenCL.so: cannot open shared object file: No such file or
06:22:00:             directory
06:22:00:******************************* libFAH ********************************
06:22:00:       Date: Apr 15 2020
06:22:00:       Time: 21:43:24
06:22:00:   Revision: 216968bc7025029c841ed6e36e81a03a316890d3
06:22:00:     Branch: master
06:22:00:   Compiler: GNU 8.3.0
06:22:00:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
06:22:00:             -fno-pie
06:22:00:   Platform: linux2 4.19.0-5-amd64
06:22:00:       Bits: 64
06:22:00:       Mode: Release
06:22:00:***********************************************************************
06:22:00:<config>
06:22:00:  <!-- Network -->
06:22:00:  <proxy v=':8080'/>
06:22:00:
06:22:00:  <!-- Slot Control -->
06:22:00:  <power v='full'/>
06:22:00:
06:22:00:  <!-- Folding Slots -->
06:22:00:  <slot id='0' type='CPU'/>
06:22:00:</config>
06:22:00:WU00:FS00:Starting
06:22:00:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /home/USERNAME/FAH/cores/cores.foldingathome.org/lin/64bit-sse2/a7-0.0.19/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 706 -lifeline 26760 -checkpoint 15 -np 24
06:22:00:WU00:FS00:Started FahCore on PID 9768
06:22:00:WU00:FS00:Core PID:7241
06:22:00:WU00:FS00:FahCore 0xa7 started
06:22:01:WU00:FS00:0xa7:*********************** Log Started 2020-08-11T06:22:00Z ***********************
06:22:01:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
06:22:01:WU00:FS00:0xa7:       Type: 0xa7
06:22:01:WU00:FS00:0xa7:       Core: Gromacs
06:22:01:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 706 -lifeline 9768 -checkpoint 15 -np
06:22:01:WU00:FS00:0xa7:             24
06:22:01:WU00:FS00:0xa7:************************************ CBang *************************************
06:22:01:WU00:FS00:0xa7:       Date: Nov 27 2019
06:22:01:WU00:FS00:0xa7:       Time: 11:26:54
06:22:01:WU00:FS00:0xa7:   Revision: d25803215b59272441049dfa05a0a9bf7a6e3c48
06:22:01:WU00:FS00:0xa7:     Branch: master
06:22:01:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
06:22:01:WU00:FS00:0xa7:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
06:22:01:WU00:FS00:0xa7:             -fno-pie -fPIC
06:22:01:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
06:22:01:WU00:FS00:0xa7:       Bits: 64
06:22:01:WU00:FS00:0xa7:       Mode: Release
06:22:01:WU00:FS00:0xa7:************************************ System ************************************
06:22:01:WU00:FS00:0xa7:        CPU: Intel(R) Xeon(R) CPU X5675 @ 3.07GHz
06:22:01:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 44 Stepping 2
06:22:01:WU00:FS00:0xa7:       CPUs: 24
06:22:01:WU00:FS00:0xa7:     Memory: 39.99GiB
06:22:01:WU00:FS00:0xa7:Free Memory: 2.22GiB
06:22:01:WU00:FS00:0xa7:    Threads: POSIX_THREADS
06:22:01:WU00:FS00:0xa7: OS Version: 3.11
06:22:01:WU00:FS00:0xa7:Has Battery: false
06:22:01:WU00:FS00:0xa7: On Battery: false
06:22:01:WU00:FS00:0xa7: UTC Offset: 2
06:22:01:WU00:FS00:0xa7:        PID: 7241
06:22:01:WU00:FS00:0xa7:        CWD: /home/bernd/FAH/work
06:22:01:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
06:22:01:WU00:FS00:0xa7:    Version: 0.0.19
06:22:01:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
06:22:01:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
06:22:01:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
06:22:01:WU00:FS00:0xa7:       Date: Nov 26 2019
06:22:01:WU00:FS00:0xa7:       Time: 00:41:43
06:22:01:WU00:FS00:0xa7:   Revision: d5b5c747532224f986b7cd02c968ed9a20c16d6e
06:22:01:WU00:FS00:0xa7:     Branch: master
06:22:01:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
06:22:01:WU00:FS00:0xa7:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
06:22:01:WU00:FS00:0xa7:             -fno-pie
06:22:01:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
06:22:01:WU00:FS00:0xa7:       Bits: 64
06:22:01:WU00:FS00:0xa7:       Mode: Release
06:22:01:WU00:FS00:0xa7:************************************ Build *************************************
06:22:01:WU00:FS00:0xa7:       SIMD: sse2
06:22:01:WU00:FS00:0xa7:********************************************************************************
06:22:01:WU00:FS00:0xa7:Project: 14703 (Run 213, Clone 0, Gen 122)
06:22:01:WU00:FS00:0xa7:Unit: 0x0000008503159d0b5eb159232f247f05
06:22:01:WU00:FS00:0xa7:Digital signatures verified
06:22:01:WU00:FS00:0xa7:Calling: mdrun -s frame122.tpr -o frame122.trr -cpi state.cpt -cpt 15 -nt 24
06:22:01:WU00:FS00:0xa7:Steps: first=0 total=250000
06:22:05:WU00:FS00:0xa7:Completed 10552 out of 250000 steps (4%)
[...]
07:41:23:WU00:FS00:0xa7:Completed 250000 out of 250000 steps (100%)
07:41:25:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
07:41:25:WU00:FS00:0xa7:Saving result file dhdl.xvg
07:41:25:WU00:FS00:0xa7:Saving result file frame122.trr
07:41:25:WU00:FS00:0xa7:Saving result file md.log
07:41:25:WU00:FS00:0xa7:Saving result file pullf.xvg
07:41:25:WU00:FS00:0xa7:Saving result file pullx.xvg
07:41:25:WU00:FS00:0xa7:Saving result file science.log
07:41:25:WU00:FS00:0xa7:Saving result file traj_comp.xtc
07:41:25:WU00:FS00:0xa7:Folding@home Core Shutdown: FINISHED_UNIT
07:41:25:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
07:41:26:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14703 run:213 clone:0 gen:122 core:0xa7 unit:0x0000008503159d0b5eb159232f247f05
07:41:26:WU00:FS00:Uploading 6.82MiB to 3.21.157.11
07:41:26:WU00:FS00:Connecting to 3.21.157.11:8080
07:42:41:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
07:42:41:WU00:FS00:Connecting to 3.21.157.11:80
07:42:59:WU00:FS00:Upload 0.92%
07:42:59:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
07:42:59:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14703 run:213 clone:0 gen:122 core:0xa7 unit:0x0000008503159d0b5eb159232f247f05
07:42:59:WU00:FS00:Uploading 6.82MiB to 3.21.157.11
07:42:59:WU00:FS00:Connecting to 3.21.157.11:8080
07:44:14:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
07:44:14:WU00:FS00:Connecting to 3.21.157.11:80
07:46:49:WU00:FS00:Upload 0.92%
07:46:49:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
07:46:49:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14703 run:213 clone:0 gen:122 core:0xa7 unit:0x0000008503159d0b5eb159232f247f05
07:46:49:WU00:FS00:Uploading 6.82MiB to 3.21.157.11
07:46:49:WU00:FS00:Connecting to 3.21.157.11:8080
07:48:04:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
07:48:04:WU00:FS00:Connecting to 3.21.157.11:80
07:49:19:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 3.21.157.11:80: Connection timed out
07:49:19:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14703 run:213 clone:0 gen:122 core:0xa7 unit:0x0000008503159d0b5eb159232f247f05
07:49:19:WU00:FS00:Uploading 6.82MiB to 3.21.157.11
07:49:19:WU00:FS00:Connecting to 3.21.157.11:8080
08:00:26:WU00:FS00:Upload 0.92%
08:00:26:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
08:00:26:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14703 run:213 clone:0 gen:122 core:0xa7 unit:0x0000008503159d0b5eb159232f247f05
08:00:26:WU00:FS00:Uploading 6.82MiB to 3.21.157.11
08:00:26:WU00:FS00:Connecting to 3.21.157.11:8080
08:01:41:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
08:01:41:WU00:FS00:Connecting to 3.21.157.11:80
08:02:56:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 3.21.157.11:80: Connection timed out
08:04:40:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14703 run:213 clone:0 gen:122 core:0xa7 unit:0x0000008503159d0b5eb159232f247f05
08:04:40:WU00:FS00:Uploading 6.82MiB to 3.21.157.11
08:04:40:WU00:FS00:Connecting to 3.21.157.11:8080
08:05:55:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
08:05:55:WU00:FS00:Connecting to 3.21.157.11:80
08:07:11:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 3.21.157.11:80: Connection timed out
08:11:32:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14703 run:213 clone:0 gen:122 core:0xa7 unit:0x0000008503159d0b5eb159232f247f05
08:11:32:WU00:FS00:Uploading 6.82MiB to 3.21.157.11
08:11:32:WU00:FS00:Connecting to 3.21.157.11:8080
08:12:47:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
08:12:47:WU00:FS00:Connecting to 3.21.157.11:80
08:14:02:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 3.21.157.11:80: Connection timed out
08:22:37:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14703 run:213 clone:0 gen:122 core:0xa7 unit:0x0000008503159d0b5eb159232f247f05
08:22:37:WU00:FS00:Uploading 6.82MiB to 3.21.157.11
08:22:37:WU00:FS00:Connecting to 3.21.157.11:8080
08:22:56:WU00:FS00:Upload 0.92%
08:22:56:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
08:40:34:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14703 run:213 clone:0 gen:122 core:0xa7 unit:0x0000008503159d0b5eb159232f247f05
08:40:34:WU00:FS00:Uploading 6.82MiB to 3.21.157.11
08:40:34:WU00:FS00:Connecting to 3.21.157.11:8080
08:41:49:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
08:41:49:WU00:FS00:Connecting to 3.21.157.11:80
08:43:04:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 3.21.157.11:80: Connection timed out
09:09:36:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14703 run:213 clone:0 gen:122 core:0xa7 unit:0x0000008503159d0b5eb159232f247f05
09:09:36:WU00:FS00:Uploading 6.82MiB to 3.21.157.11
09:09:36:WU00:FS00:Connecting to 3.21.157.11:8080
09:10:51:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
09:10:51:WU00:FS00:Connecting to 3.21.157.11:80
09:12:06:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 3.21.157.11:80: Connection timed out
09:56:35:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14703 run:213 clone:0 gen:122 core:0xa7 unit:0x0000008503159d0b5eb159232f247f05
09:56:35:WU00:FS00:Uploading 6.82MiB to 3.21.157.11
09:56:35:WU00:FS00:Connecting to 3.21.157.11:8080
09:57:50:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
09:57:50:WU00:FS00:Connecting to 3.21.157.11:80
09:59:05:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 3.21.157.11:80: Connection timed out
It looks like the server is overloaded, or doesn't have enough bandwidth. Connecting to it with a web browser sometimes gives me the "Work Server Version something" page, but only after a very long wait, and sometimes the connection times out.

Is there anything else I should try, or is this a problem with the server that is known already?


Cheers,
HG

Re: 3.21.157.11 overloaded?

Posted: Tue Aug 11, 2020 10:28 am
by Neil-B
Encountered what might be similar issue in th past with this server viewtopic.php?f=18&t=35201 ... Use whatever the linux equivalent of TCPView to se if there is a hanging established connect and if so kill it ... iirc the linux version of TCPView doesn't help you kill connections.

Something up with 3.21.157.11 ???

Posted: Tue Aug 11, 2020 11:27 am
by JPetovello
It seems that 3.21.157.11 is having issues

Code: Select all

*********************** Log Started 2020-08-11T11:11:56Z ***********************
11:11:56:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14726 run:458 clone:2 gen:1 core:0xa7 unit:0x0000000103159d0b5ebcaba2941b3bfe
11:11:56:WU02:FS00:Uploading 6.79MiB to 3.21.157.11
11:11:56:WU02:FS00:Connecting to 3.21.157.11:8080
11:12:11:WU02:FS00:Upload 0.92%
11:12:11:WARNING:WU02:FS00:Exception: Failed to send results to work server: Transfer failed
11:12:12:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14726 run:458 clone:2 gen:1 core:0xa7 unit:0x0000000103159d0b5ebcaba2941b3bfe
11:12:12:WU02:FS00:Uploading 6.79MiB to 3.21.157.11
11:12:12:WU02:FS00:Connecting to 3.21.157.11:8080
11:12:33:WARNING:WU02:FS00:WorkServer connection failed on port 8080 trying 80
11:12:33:WU02:FS00:Connecting to 3.21.157.11:80
11:12:54:WARNING:WU02:FS00:Exception: Failed to send results to work server: Failed to connect to 3.21.157.11:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
11:13:12:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14726 run:458 clone:2 gen:1 core:0xa7 unit:0x0000000103159d0b5ebcaba2941b3bfe
11:13:12:WU02:FS00:Uploading 6.79MiB to 3.21.157.11
11:13:12:WU02:FS00:Connecting to 3.21.157.11:8080
11:13:33:WARNING:WU02:FS00:WorkServer connection failed on port 8080 trying 80
11:13:33:WU02:FS00:Connecting to 3.21.157.11:80
11:13:34:WU02:FS00:Upload 0.92%
11:13:43:WU02:FS00:Upload 1.84%
11:13:43:WARNING:WU02:FS00:Exception: Failed to send results to work server: Transfer failed
11:14:49:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14726 run:458 clone:2 gen:1 core:0xa7 unit:0x0000000103159d0b5ebcaba2941b3bfe
11:14:49:WU02:FS00:Uploading 6.79MiB to 3.21.157.11
11:14:49:WU02:FS00:Connecting to 3.21.157.11:8080
11:15:14:WU02:FS00:Upload 1.84%
11:15:14:WARNING:WU02:FS00:Exception: Failed to send results to work server: Transfer failed
11:17:26:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14726 run:458 clone:2 gen:1 core:0xa7 unit:0x0000000103159d0b5ebcaba2941b3bfe
11:17:26:WU02:FS00:Uploading 6.79MiB to 3.21.157.11
11:17:26:WU02:FS00:Connecting to 3.21.157.11:8080
11:17:47:WARNING:WU02:FS00:WorkServer connection failed on port 8080 trying 80
11:17:47:WU02:FS00:Connecting to 3.21.157.11:80
11:18:08:WARNING:WU02:FS00:Exception: Failed to send results to work server: Failed to connect to 3.21.157.11:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
11:21:40:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14726 run:458 clone:2 gen:1 core:0xa7 unit:0x0000000103159d0b5ebcaba2941b3bfe
11:21:40:WU02:FS00:Uploading 6.79MiB to 3.21.157.11
11:21:40:WU02:FS00:Connecting to 3.21.157.11:8080
11:21:47:WU02:FS00:Upload 0.92%
11:22:12:WU02:FS00:Upload 1.84%
11:22:12:WARNING:WU02:FS00:Exception: Failed to send results to work server: Transfer failed

Re: 3.21.157.11 overloaded?

Posted: Tue Aug 11, 2020 11:29 am
by Hopfgeist
Neil-B wrote:Encountered what might be similar issue in th past with this server viewtopic.php?f=18&t=35201 ... Use whatever the linux equivalent of TCPView to se if there is a hanging established connect and if so kill it ... iirc the linux version of TCPView doesn't help you kill connections.
Yes, thanks. I have these troubles with multiple clients, but some eventually succeed in getting a work unit. The one from which I posted the log did after I terminated and restarted it. (A gentle "stop" request didn't break the hung tcp connection.)

The TCP connection was established, so it looks like the server's low-level TCP/IP stack is working properly, but the request never gets serviced by the work server application.

My working hypothesis is that more RAM might help :)

Maybe with the demise of all azure servers, some of the other servers are not quite up to the load. But that's pure speculation, I have no further insight.

Cheers,
HG

Re: 3.21.157.11 overloaded?

Posted: Tue Aug 11, 2020 11:57 am
by Neil-B
if you have multiple slots then just dropping the stalled connection will cause get the client to reconnect to the AS again and saves you having to restart the client ... but anything that works is a charm :)

Reckon the server is struggling due to overloaded comms ... hopefully workloads can be rebalanced a bit.

Re: Something up with 3.21.157.11 ???

Posted: Tue Aug 11, 2020 12:01 pm
by Neil-B
Looks like it is having comms overload issues .. Your the 2nd person to post on this :) .. with a bit of luck someone can rebalance the workloads a bit .. basically just let the client keep retrying and hope it uploads ... these things can take a bit of time to improve as more and more clients start trying to grab the comms - someone may be able to kick something but in the meantime patience is the answer.

Re: 3.21.157.11 overloaded?

Posted: Tue Aug 11, 2020 1:09 pm
by Juggy
I am having the same issue with this server, in fact I had 3 completed WU's waiting to upload until I exited and restarted the client.

Re: 3.21.157.11 overloaded?

Posted: Tue Aug 11, 2020 1:24 pm
by Neil-B
... restarting really shouldn't be necesssary - the current client handles retries properly ... I guess you may have got lucky and got a clear run at a comms slot, but for the most part this can actually add to the load on the server and slow things down more :(

Re: 3.21.157.11 overloaded?

Posted: Tue Aug 11, 2020 2:10 pm
by Nuitari
No the client does NOT handle retries properly. It might eventually time out. But it will NOT try to go to a different server until you restart the client.

Re: 3.21.157.11 overloaded?

Posted: Tue Aug 11, 2020 2:21 pm
by Hopfgeist
Nuitari wrote:No the client does NOT handle retries properly. It might eventually time out. But it will NOT try to go to a different server until you restart the client.
It handles it fine for uploads, as far as I can see. Uploads always have to go to a specific server, and there is no choice. The client times out, and retries after a while. with the same server. As it should.

However, you are right, it does not handle stuck download connections gracefully. As Neil-B himself said a few months ago, an established TCP connection can be stuck, apparently indefinitely. Now you may be able to reset the connection on the operating system-level, and get the client to continue, but the client itself does not handle it. It also appears to ignore any timeout values, one of which (max-connect-time) is described as The maximum amount of time, in seconds, a client can be connected to the server, which defaults to 900, which is pretty long, but evidently sometimes exceeded significantly (I have seen on the order of 5000 seconds before I shut it down).


Cheers,
HG

Re: 3.21.157.11 overloaded?

Posted: Tue Aug 11, 2020 2:27 pm
by Neil-B
@Nuitari ... For returning WUs to WS/CS (which is what Juggy and JPetovello were talking about) a restart will simply (if you are lucky) jump the queue and upload ... It doesn't send WUs to a different server !!! ... the upload retry process to the best of my knowledge works fine.

If you are talking about the OP issues trying to get new WUs that is a different issue which is no doubt being worked on - and as I have already answered on that part of the thread in many cases it doesn't need the client restarted just the hanging/stalled connection to be killed/reset :)

This issue on getting new WUs seems to occur ion this server when very heavily loaded - which it is at the moment - hopefully something can be done to alleviate that but with the (5?) azure WSs no longer in play that might be quite hard to do given the load this server handles

Re: 3.21.157.11 overloaded?

Posted: Tue Aug 11, 2020 3:30 pm
by Hopfgeist
Neil-B wrote: [...]
in many cases it doesn't need the client restarted just the hanging/stalled connection to be killed/reset :)
Sure, but for most people, stopping and restarting the client is a lot easier than resetting a TCP connection. ;)

HG

Re: 3.21.157.11 overloaded?

Posted: Tue Aug 11, 2020 3:37 pm
by Neil-B
Thats why I have TCPView - 3 clicks iirc - and given at the time I was getting this issue every WU it was the quickest easiest way as it didn't impact the other slots folding on the same machine ... but yes for many people a restart might be easiest :)

... thinking about it for most people (not the dev effort) the easiest way would be for it not to hang in the first place :shock: :lol: but these type of intermittent server issues can be a real pain to resolve :cry:

Re: 3.21.157.11 overloaded?

Posted: Wed Aug 12, 2020 3:17 am
by PantherX
Nuitari wrote:...But it will NOT try to go to a different server until you restart the client.
Just to clarify, the client has no say/choice when it comes to download the WU from a Server. That decision is made by the Assignment Server (AS) based on the information that the client has provided and the availability of Work Servers (WS) for specific WUs. Thus, you can restart your client as many times as you would like but if there's only one WS that meets your client's requirements, you will always be directed to it.

Re: 3.21.157.11 overloaded?

Posted: Wed Aug 12, 2020 9:08 am
by Juggy
Neil-B wrote:... restarting really shouldn't be necesssary - the current client handles retries properly ... I guess you may have got lucky and got a clear run at a comms slot, but for the most part this can actually add to the load on the server and slow things down more :(
Maybe properly by its design but if it handled them efficiently it would a) redirect to a less busy server (pretty sure having a table of available upload servers loaded on the client when it downloads a WU would not be out of the question) or b) retry more often under specific circumstances instead of extending the retry interval and as a result decimating the clients credit.

Or, to negate the need for the client to make such decisions, why is there not an upload director like the assignment servers that would automatically direct the client to the least busy server?