128.59.74.4 in Reject

Moderators: Site Moderators, FAHC Science Team

Mactin
Posts: 222
Joined: Sun Dec 02, 2007 1:08 pm
Location: Côte-des-Neiges, Montréal, Québec

128.59.74.4 in Reject

Post by Mactin »

I have two WUs trying to upload results to 128.59.74.4 in Reject mode. The Associated CS will not accept them either. I have three more comming due this morning to the same server.

Project: 3856 (Run 240, Clone 0, Gen 22)

Code: Select all

[13:12:18] + Attempting to send results [February 24 13:12:18 UTC]
[13:12:18] - Reading file work/wuresults_01.dat from core
[13:12:18]   (Read 1356362 bytes from disk)
[13:12:18] Connecting to http://128.59.74.4:8080/
[13:12:19] - Couldn't send HTTP request to server
[13:12:19] + Could not connect to Work Server (results)
[13:12:19]     (128.59.74.4:8080)
[13:12:19] + Retrying using alternative port
[13:12:19] Connecting to http://128.59.74.4:80/
[13:12:20] - Couldn't send HTTP request to server
[13:12:20] + Could not connect to Work Server (results)
[13:12:20]     (128.59.74.4:80)
[13:12:20] - Error: Could not transmit unit 01 (completed February 24) to work server.
[13:12:20] - 675 failed uploads of this unit.


[13:12:20] + Attempting to send results [February 24 13:12:20 UTC]
[13:12:20] - Reading file work/wuresults_01.dat from core
[13:12:20]   (Read 1356362 bytes from disk)
[13:12:20] Connecting to http://171.65.103.100:8080/
[13:12:20] - Couldn't send HTTP request to server
[13:12:20]   (Got status 503)
[13:12:20] + Could not connect to Work Server (results)
[13:12:20]     (171.65.103.100:8080)
[13:12:20] + Retrying using alternative port
[13:12:20] Connecting to http://171.65.103.100:80/
[13:12:23] Posted data.
[13:12:23] Initial: 0000; - Server reports packet it received specified a data size of 0.
[13:12:23]   (May be due to corruption during network transmission or a corrupted file.)
[13:12:23]   Could not transmit unit 01 to Collection server; keeping in queue.
[13:12:23] + Sent 0 of 1 completed units to the server
[13:12:23] - Failed to send all units to server
[13:12:23] ***** Got a SIGTERM signal (2)
[13:12:23] Killing all core threads
and
Project: 3859 (Run 7985, Clone 0, Gen 8)

Code: Select all

[13:10:44] + Attempting to send results [February 24 13:10:44 UTC]
[13:10:44] - Reading file work/wuresults_01.dat from core
[13:10:44]   (Read 871488 bytes from disk)
[13:10:44] Connecting to http://128.59.74.4:8080/
[13:10:45] - Couldn't send HTTP request to server
[13:10:45] + Could not connect to Work Server (results)
[13:10:45]     (128.59.74.4:8080)
[13:10:45] + Retrying using alternative port
[13:10:45] Connecting to http://128.59.74.4:80/
[13:10:46] - Couldn't send HTTP request to server
[13:10:46] + Could not connect to Work Server (results)
[13:10:46]     (128.59.74.4:80)
[13:10:46] - Error: Could not transmit unit 01 (completed February 24) to work server.
[13:10:46] - 748 failed uploads of this unit.


[13:10:46] + Attempting to send results [February 24 13:10:46 UTC]
[13:10:46] - Reading file work/wuresults_01.dat from core
[13:10:46]   (Read 871488 bytes from disk)
[13:10:46] Connecting to http://171.65.103.100:8080/
[13:10:46] - Couldn't send HTTP request to server
[13:10:46]   (Got status 503)
[13:10:46] + Could not connect to Work Server (results)
[13:10:46]     (171.65.103.100:8080)
[13:10:46] + Retrying using alternative port
[13:10:46] Connecting to http://171.65.103.100:80/
[13:10:46] - Couldn't send HTTP request to server
[13:10:46]   (Got status 503)
[13:10:46] + Could not connect to Work Server (results)
[13:10:46]     (171.65.103.100:80)
[13:10:46]   Could not transmit unit 01 to Collection server; keeping in queue.
[13:10:46] + Sent 0 of 1 completed units to the server
[13:10:46] - Failed to send all units to server
[13:10:46] ***** Got a SIGTERM signal (2)
[13:10:46] Killing all core threads
after 675 and 748 attemps to send overnight, I stopped them when I came in the office.
Username for these WUs is "Martin_UM_P4", team 96377 "Mactin".

BTW, I am having trouble obtaing WUs for the pas few weeks, nothing dramatic, but it can take some time to get to the AS and once assigned, it can take some more time to get a WU from busy servers. This behaviour was only seen with classic WUs until yesterday where it spread to my SMP machine at home.

Thanks
Image
mrshirts
Pande Group Member
Posts: 54
Joined: Sat Apr 26, 2008 4:32 am

Re: 128.59.74.4 in Reject

Post by mrshirts »

There's a filesystem problem; I didn't notice it earlier because it was possible to login (hence psummary worked) but no assignments are being accepted. I'm restarting and diagnosing it now.
mrshirts
Pande Group Member
Posts: 54
Joined: Sat Apr 26, 2008 4:32 am

Re: 128.59.74.4 in Reject

Post by mrshirts »

Bad news. The RAID array may need to be rebuilt. I'll keep posting here, and I'll try to figure out the problem with the collection server as well.
toTOW
Site Moderator
Posts: 6334
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: 128.59.74.4 in Reject

Post by toTOW »

Vijay made an announcement about this server : viewtopic.php?f=24&t=8601
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
Teddy
Posts: 134
Joined: Tue Feb 12, 2008 3:05 am
Location: Canberra, Australia
Contact:

Re: 128.59.74.4 in Reject

Post by Teddy »

That would explain the 3 work units I cant return to that server.

Teddy
toTOW
Site Moderator
Posts: 6334
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: 128.59.74.4 in Reject

Post by toTOW »

Indeed ©
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
JanQ
Posts: 2
Joined: Wed Jul 09, 2008 9:24 am

Re: 128.59.74.4 in Reject

Post by JanQ »

[11:36:07] Completed 130780 out of 300000 steps (43%)
[11:36:16] - Couldn't send HTTP request to server
[11:36:16] + Could not connect to Work Server (results)
[11:36:16] (128.59.74.4:8080)
[11:36:16] + Retrying using alternative port
[11:36:37] - Couldn't send HTTP request to server
[11:36:37] + Could not connect to Work Server (results)
[11:36:37] (128.59.74.4:80)
[11:36:37] - Error: Could not transmit unit 02 (completed February 24) to work server.


[11:36:37] + Attempting to send results [February 26 11:36:37 UTC]
[11:36:53] - Couldn't send HTTP request to server
[11:36:53] + Could not connect to Work Server (results)
[11:36:53] (171.65.103.100:8080)
[11:36:53] + Retrying using alternative port
[11:37:09] - Couldn't send HTTP request to server
[11:37:09] + Could not connect to Work Server (results)
[11:37:09] (171.65.103.100:80)
[11:37:09] Could not transmit unit 02 to Collection server; keeping in queue.

:( :( :(
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 128.59.74.4 in Reject

Post by bruce »

JanQ wrote:[11:36:37] (128.59.74.4:80)
[11:36:37] - Error: Could not transmit unit 02 (completed February 24) to work server.
[11:37:09] (171.65.103.100:80)
[11:37:09] Could not transmit unit 02 to Collection server; keeping in queue.
Yes, that's the issue mentioned in the announcement referenced above. There's nothing you can do except be patient.
Mactin
Posts: 222
Joined: Sun Dec 02, 2007 1:08 pm
Location: Côte-des-Neiges, Montréal, Québec

Re: 128.59.74.4 in Reject

Post by Mactin »

As a short term measure, would it be possible to enable the CS's to receive these WUs ?
Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 128.59.74.4 in Reject

Post by bruce »

Mactin wrote:As a short term measure, would it be possible to enable the CS's to receive these WUs ?
If you check http://fah-web.stanford.edu/serverstat.html you'll see that the CS (.100) is enabled but extremely busy. Unfortunately the WU that has been assigned to you can only be uploaded to a specific CS, hence my "be patient" comment. It's impossible to tell how many WUs are waiting to upload to the same overtaxed CS but I'd guess it's a lot. There's nothing can be done except wait.
BABackman
Posts: 8
Joined: Wed Dec 17, 2008 2:41 am

Re: 128.59.74.4 in Reject

Post by BABackman »

toTOW wrote:Vijay made an announcement about this server : viewtopic.php?f=24&t=8601
The Feb 24 message said a RAID rebuild would take "a few days." Any word on the patient's progress or prognosis? Hopefully, it's not as grim as it looks?
Robby_Firefox
Posts: 75
Joined: Sun Jan 20, 2008 9:18 pm
Hardware configuration: Homebuilt Windows 10
Intel Core i7-4770 ~ started in early 2014 or 2015
32 GB RAM (up from 8 GB in 2018)
64-Bit Operating System
On 31 Mar 2020, installed a GigaByte GEFORCE GTX 1660 GPU
Location: Madison, AL

Re: 128.59.74.4 in Reject

Post by Robby_Firefox »

I hope the repairs of this server is going well. Any news of when it will be returned to service?

I am using the classic client 6.23 on a Windows XP Pro platform at this time; and have: Project: 3859 -- Run 8092 waiting. About every six hours, it attempts to send this completed project back to FAH. Will 3859 stay in the client's queue until it gets successfully sent to FAH?

Thanks,
Robby / Team Firefox
Image
Processor: I7-4770 Memory: 32Gig RAM GPU: GeForce GTX 1660 MB: GigaByte GA-B85M-D3H
toTOW
Site Moderator
Posts: 6334
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: 128.59.74.4 in Reject

Post by toTOW »

Yes it will stay in the queue until the server gets back online, or the final deadline passes.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
mrshirts
Pande Group Member
Posts: 54
Joined: Sat Apr 26, 2008 4:32 am

Re: 128.59.74.4 in Reject

Post by mrshirts »

Update on 128.59.74.4:
The good news, all the data (2 TB) is safe. I was able to rebuild and mount the raid. The bad news is, the server won't boot normally. Since it's actually at Columbia (where I don't work anymore), I'm a bit at the mercy of the IT support staff there in terms of getting it up and running again. The current plan is therefore to copy the data off that is needed to continue the projects, and try to relay the IP to a different machine, putting it into accept only mode. Time line is probably going to be about a week, unfortunately.
kelliegang
Posts: 90
Joined: Wed Mar 04, 2009 4:30 am
Hardware configuration: L1:Dual Core 1.6ghz, 1GB Ram
L2:
PC1&2:P2 3.2ghz, 1GB Ram, Gforce 6800 :(
PS3
Location: Australia

Re: 128.59.74.4 in Reject

Post by kelliegang »

Thanks for the update mrshirts really appreciate it :) Glad nothing was lost also.
Post Reply