WS 140.163.4.200 Upload blocked by untangle"
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 5
- Joined: Sat Apr 04, 2020 10:27 am
WS 140.163.4.200 Upload blocked by untangle"
Hello,
the upload to ws 140.163.4.200 is not possible at the moment:
12:13:50:WU00:FS01:Trying to send results to collection server
12:13:50:WU00:FS01:Uploading 8.45MiB to 140.163.4.200
12:13:50:WU00:FS01:Connecting to 140.163.4.200:8080
12:14:24:WU00:FS01:Upload 2.22%
12:14:24:ERROR:WU00:FS01:Exception: Transfer failed
That has happened several times now.
Please check this, thx!
the upload to ws 140.163.4.200 is not possible at the moment:
12:13:50:WU00:FS01:Trying to send results to collection server
12:13:50:WU00:FS01:Uploading 8.45MiB to 140.163.4.200
12:13:50:WU00:FS01:Connecting to 140.163.4.200:8080
12:14:24:WU00:FS01:Upload 2.22%
12:14:24:ERROR:WU00:FS01:Exception: Transfer failed
That has happened several times now.
Please check this, thx!
Re: WS 140.163.4.200 No Upload possible
Same here, so it's probably a general issue.
In fact, this server (140.163.4.200) is the collection server linked with the work server 18.188.125.154, which is also unaccessible. I currently have 3 WUs stuck on upload, all on project 13444.
In fact, this server (140.163.4.200) is the collection server linked with the work server 18.188.125.154, which is also unaccessible. I currently have 3 WUs stuck on upload, all on project 13444.
Nvidia RTX 3060 Ti & GTX 1660 Super - AMD Ryzen 7 5800X - MSI MEG X570 Unify - 16 GB RAM - Ubuntu 20.04.2 LTS - Nvidia drivers 460.56
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: WS 140.163.4.200 No Upload possible
May well be overloaded comms trying to cope with the ofload from aws3 .. believe from post in discord that may be up and running again or at least should be soon so things may start to sort themselves out
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
-
- Site Admin
- Posts: 7927
- Joined: Tue Apr 21, 2009 4:41 pm
- Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2 - Location: W. MA
Re: WS 140.163.4.200 No Upload possible
WS aws3 is currently down, it ran out of space. They started work to free up space yesterday evening, had freed up some by around midnight, and that all filled up again.
iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Re: WS 140.163.4.200 No Upload possible
I don't know the constraints and I don't want to annoy people, but I see that there is well over 1,500 TiB free on available servers (pllwskifah1.mskcc.org: 233, pllwskifah2.mskcc.org: 233, highland1.engr.wustl.edu: 545, vav17: 100, vav21: 100, vav19: 100, vav22: 100, vav23: 100, vav24: 100).
It might be worth programming something to better balance the load, no?
It might be worth programming something to better balance the load, no?
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: WS 140.163.4.200 No Upload possible
Work is iirc placed on kit normally located/managed by the labs running the projects .. don't think it is a simple as just load balancing across what may look like a single extended capability but is actually made up of locally managed/controlled kit
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Re: WS 140.163.4.200 No Upload possible
I understand that it may be difficult to share server space among different labs and projects.
But here, with aws3 (no space available), pllwskifah1 (233 TiB free), and pllwskifah2 (233 TiB free), it would be the same lab, the same person, and the same project type.
But here, with aws3 (no space available), pllwskifah1 (233 TiB free), and pllwskifah2 (233 TiB free), it would be the same lab, the same person, and the same project type.
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: WS 140.163.4.200 No Upload possible
If my geolocation work is correct then aws3 is hosted out of Columbus Ohio on Amazon infrastructure ... the two pllws are hosted out of New York City through Memorial Sloan Kettering ISP ... so very different locations to start with ... and looking at the projects currently associated with each of those WSs it looks like there might be different groups/system maintainers for each.
There is also the situation that with the way the code works I believe a project has to be hosted from a single WS - yes one or more CS can be associated with that but the science still needs to get back to the original WS for the science to continue - CSs are simply buffers for certain conditions - therefore wherever projects are hosted there can be issues ... Yes today could move projects from one infrastructure to another and use up space there but then that other infrastructure (which at least on of the servers in question has a whole series of projects already associated with it may fill up ... but even so relocation of projects has iirc be done - it just takes a large amount of effort and can break things such as the scripts that generate the new wus.
... and it isnt all about the disk space - the throughput and bandwidth may well play into it - what server hosts which projects and the decisions around this may well take into account the size and number of the project wus and the amount of computational effort needed to service the generation of new wus from the returned wus.
I think what I am trying to say it that it may not be as simple as just load balancing or relocating data (which in itself is non trivial given the volumes) ... I deal occasionally with relocating similar amounts of data and tbh pulling the disks from one location and sending them physically to the new location - even with the pain and cost in time and security is still in many cases the easiest, quickest and least painful approach ... just adding more disk capacity can be a short term solution but tbh in most cases the data needs to be triaged to much smaller quantities by onward processing and then relocated off the generation/collection systems and into longer term into analytic/storage/retrieval capabilities ... The challenge can be that to triage the data in itself requires significant compute/disk access so in many cases it is simpler to shut down the front end whilst dealing with the data manipulation and offload (which I guess is what is happening).
My gut also tells me (but may well be wrong) that the two pllws may be utilising some form of shared network attached storage (hence reporting same figure (which possibly might be a "duplicate" and one may be a new build as running newer server build than the other - aws3 matches the server build for the older of the two pllws builds ... I am also not sure how much the server builds are vanilla - they may be based on a build then bespoke tailored to local kit/conditions - which may mean that relocating and/or load sharing might get tripped up by slight differences in configuration.
The researchers/admins will be as disappointed/stressed/overloaded by the current issues and the contributing folders are as this will be impeding their research ... I am sure that the Fah Consortium will regularly review such issues and will be trying to move the infrastructure and software into a better place/configuration and current efforts/responses will be the best they can achieve at the moment ... I for one would not want to have t coordinate coherent development across multiple academic institutions in different countries and funding/authority regimes
... and to go back to the topic title ... .200 is the older build of the two pllws machine - it has disk space but is possibly overloaded with traffic due to as farming more requests at it because aws3 is down and because it is acting as the cs for aws3? ... ouch
There is also the situation that with the way the code works I believe a project has to be hosted from a single WS - yes one or more CS can be associated with that but the science still needs to get back to the original WS for the science to continue - CSs are simply buffers for certain conditions - therefore wherever projects are hosted there can be issues ... Yes today could move projects from one infrastructure to another and use up space there but then that other infrastructure (which at least on of the servers in question has a whole series of projects already associated with it may fill up ... but even so relocation of projects has iirc be done - it just takes a large amount of effort and can break things such as the scripts that generate the new wus.
... and it isnt all about the disk space - the throughput and bandwidth may well play into it - what server hosts which projects and the decisions around this may well take into account the size and number of the project wus and the amount of computational effort needed to service the generation of new wus from the returned wus.
I think what I am trying to say it that it may not be as simple as just load balancing or relocating data (which in itself is non trivial given the volumes) ... I deal occasionally with relocating similar amounts of data and tbh pulling the disks from one location and sending them physically to the new location - even with the pain and cost in time and security is still in many cases the easiest, quickest and least painful approach ... just adding more disk capacity can be a short term solution but tbh in most cases the data needs to be triaged to much smaller quantities by onward processing and then relocated off the generation/collection systems and into longer term into analytic/storage/retrieval capabilities ... The challenge can be that to triage the data in itself requires significant compute/disk access so in many cases it is simpler to shut down the front end whilst dealing with the data manipulation and offload (which I guess is what is happening).
My gut also tells me (but may well be wrong) that the two pllws may be utilising some form of shared network attached storage (hence reporting same figure (which possibly might be a "duplicate" and one may be a new build as running newer server build than the other - aws3 matches the server build for the older of the two pllws builds ... I am also not sure how much the server builds are vanilla - they may be based on a build then bespoke tailored to local kit/conditions - which may mean that relocating and/or load sharing might get tripped up by slight differences in configuration.
The researchers/admins will be as disappointed/stressed/overloaded by the current issues and the contributing folders are as this will be impeding their research ... I am sure that the Fah Consortium will regularly review such issues and will be trying to move the infrastructure and software into a better place/configuration and current efforts/responses will be the best they can achieve at the moment ... I for one would not want to have t coordinate coherent development across multiple academic institutions in different countries and funding/authority regimes
... and to go back to the topic title ... .200 is the older build of the two pllws machine - it has disk space but is possibly overloaded with traffic due to as farming more requests at it because aws3 is down and because it is acting as the cs for aws3? ... ouch
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Re: WS 140.163.4.200 No Upload possible
Thank you Neil! Great post! I understand the situation much better now.
Alright, best wishes and Godspeed to the team that has to sort this out!
Alright, best wishes and Godspeed to the team that has to sort this out!
Re: WS 140.163.4.200 No Upload possible
The upload issue has been carefully explained (above) and ideally, the WUs destined to be returned to Work Server A might need to be temporarily uploaded to server B which is acting as a Collection Server for WS A. The projects based at A will be interrupted until service at A is restored but you don't see that issue as long as service at B can buffer the uploads. The projects based at A can also be buffered on your HD but you'll start getting warning messages.
A secondary issue is whether your kit can be assigned some OTHER project (based at any OTHER server C1, C2, C3...) can distribute work for you to do.
A secondary issue is whether your kit can be assigned some OTHER project (based at any OTHER server C1, C2, C3...) can distribute work for you to do.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
Re: WS 140.163.4.200 No Upload possible
It's still having problems I'm afraid. I'm sure for most folders uploading multiple times just mean things take a bit longer. For me, and the others like me with a data limit, uploading 5 times instead of 1 means 2 WU"S I'll never be able to start. The bandwidth for them is gone. I hope it gets fixed eventually.
Re: WS 140.163.4.200 No Upload possible
I can upload to 140.163.4.200 but I think that there is something wrong or delayed with the stats again.
I haven't checked all WUs to isolate the faulty server, but since at least 36 hours, my EOC account gets only some 60-70% of the Total Estimated Points Per Day announced by FAHControl.
And I observe the same gap lately in the EOC aggregate summary https://folding.extremeoverclocking.com ... ary.php?s= as well as in several teams summaries I checked.
I haven't checked all WUs to isolate the faulty server, but since at least 36 hours, my EOC account gets only some 60-70% of the Total Estimated Points Per Day announced by FAHControl.
And I observe the same gap lately in the EOC aggregate summary https://folding.extremeoverclocking.com ... ary.php?s= as well as in several teams summaries I checked.
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: WS 140.163.4.200 No Upload possible
For anyone with this "No Upload" issue you may want to double check firewall/av (untangle was the firewall in question below) settings especially if the message is along the lines of "Received short response, expected 8 bytes, got 0" ... I'll post below a discord post form someone with similar issues ... even though the firewall was allowing all traffic it was inspecting the flow and that caused this issue ... Once he turned that off all was fine.
Jason EllingsonToday at 11:24
Well... gee... figured out the problem. I set my firewall to pass all traffic... but apparently it still "inspects" port 80 traffic for compliance. Disabled that and voila! We're in business.
Jason EllingsonToday at 11:24
Well... gee... figured out the problem. I set my firewall to pass all traffic... but apparently it still "inspects" port 80 traffic for compliance. Disabled that and voila! We're in business.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)