Page 1 of 1

175.65.103.160

Posted: Mon Jul 27, 2015 7:16 pm
by RABishop
I have two jobs on one of my computers waiting to be sent to the server, above. I have no idea how long they've been waiting to be sent, but points are certainly pealing away from the finished products. I have tried stopping the client and restarting it, as well as rebooting the computer. Running Win 7 Pro, 64 Bit, with Intel 4790k processor at 4.0 GHz. 16 gigs of DDR3 memory. I'm running 2 NVidia GTX 970s. All G-Force drivers are up to date.

Apparently either the restart of the client or the reboot of the computer has truncated the log, so I can't look back more than a few minutes there. The two jobs currently running on the GTX 970s are due to try to send their results to this same collection server in about 51 minutes, and in around 7 hours. So, it appears possible (but unknown due to lack of log data) that these two stuck jobs were produced by the GTX cards: the CPU is using another collection server, entirely. If so, judging from points left, they've been stuck quite a while. The work server for the two stuck jobs was 171.67.108.31 with different work servers for the jobs now running. Any ideas?

Re: 175.65.103.160

Posted: Tue Jul 28, 2015 5:45 am
by RABishop
Well, it appears there is no help on getting these finished results to upload to FAH. On the brighter side, the client continues to function, downloading and uploading finished work as if these two, finished jobs, weren't sitting there waiting to upload at all. At some point, I'll remove the client, not saving data, and download it, configure it, and start again. I'm just a bit too "Sheldonish" to look forever at two jobs down there in that bottom window, waiting forever to upload. Maybe I'll wait until after the expiration date. If they haven't disappeared by then, THEN I'll expel them by the means mentioned. The same has happened (but only with ONE job) on another of my computers. We'll see. Cheer M8s.

Re: 175.65.103.160

Posted: Tue Jul 28, 2015 12:22 pm
by slymer
Same problem here. 2 WUs waiting to be sent. And the server status page is down too.

Anyone? Anyone? Bueller?

Re: 175.65.103.160

Posted: Tue Jul 28, 2015 12:39 pm
by bollix47
The problem has been reported to the appropriate people but it may be beyond their ability to help at this time due to other activities: viewtopic.php?p=277951#p277951

I do have 3 WUs waiting to return to WS 171.67.108.31 or CS 171.65.103.160 but neither of those destinations appear to be working for me at this time. The CS has no record of the WUs and the WS is unreachable. The WUs will timeout later today so here's hoping the problems will be 'fixed' when the systems return @ 8:am PDT.

Re: 175.65.103.160

Posted: Tue Jul 28, 2015 1:04 pm
by Joe_H
Without information about which WU's are attempting to upload, not much anyone can do to help. Look in the data directory for the folder of older logs, the client keeps the last 16 by default there.

It is entirely possible the WU's already uploaded and the acknowledgement message was not received by your client. There are other possible reasons for a WU to not upload, the information in the logs can help determine that. Checks on individual WU's or the server status will have to wait until fah-web.stanford.edu is restarted, it is currently having problems as it has a few times before.

Re: 175.65.103.160

Posted: Tue Jul 28, 2015 3:47 pm
by bruce
With a scheduled outage of 3½ days, it's not surprising that they seem to have missed their 8am deadline for at least some of the systems at Stanford. I expect things to be back to normal "soon"™

Re: 175.65.103.160

Posted: Tue Jul 28, 2015 10:36 pm
by slymer
Still can't pull up the server stats page... I'm going to assume that the university server work is affecting many systems. I'm cruising along with other WUs though, just the 2 lagging on upload. Looks like they started, but never got an ack and then I haven't been able to connect to that server since. Here's the tail info from my logs of the affected WUs anyhow... I'll just wait and see if this clears itself out or not. I'm wagering it will clear up once all the servers are back up. Looks like all the stats servers are down too, so I can't even look up if I have credit for these or not.

01:40:19:WU02:FS02:0x18:Project: 9617 (Run 6, Clone 44, Gen 8)
...
05:37:48:WU02:FS02:0x18:Completed 2000000 out of 2000000 steps (100%)
05:37:55:WU02:FS02:0x18:Saving result file logfile_01.txt
05:37:55:WU02:FS02:0x18:Saving result file checkpointState.xml
05:37:56:WU02:FS02:0x18:Saving result file checkpt.crc
05:37:56:WU02:FS02:0x18:Saving result file log.txt
05:37:56:WU02:FS02:0x18:Saving result file positions.xtc
05:37:57:WU02:FS02:0x18:Folding@home Core Shutdown: FINISHED_UNIT
05:37:57:WU02:FS02:FahCore returned: FINISHED_UNIT (100 = 0x64)
05:37:57:WU02:FS02:Sending unit results: id:02 state:SEND error:NO_ERROR project:9617 run:6 clone:44 gen:8 core:0x18 unit:0x000000090a3b1e815546f120ae6717ef
05:37:57:WU02:FS02:Uploading 2.04MiB to 171.67.108.31
05:37:57:WU02:FS02:Connecting to 171.67.108.31:8080
05:38:03:WU02:FS02:Upload 52.10%
******************************* Date: 2015-07-27 *******************************
just date lines past this.

01:40:19:WU03:FS01:0x18:Project: 9617 (Run 0, Clone 39, Gen 208)
...
04:01:04:WU03:FS01:0x18:Completed 2000000 out of 2000000 steps (100%)
04:01:11:WU03:FS01:0x18:Saving result file logfile_01.txt
04:01:11:WU03:FS01:0x18:Saving result file checkpointState.xml
04:01:11:WU03:FS01:0x18:Saving result file checkpt.crc
04:01:11:WU03:FS01:0x18:Saving result file log.txt
04:01:11:WU03:FS01:0x18:Saving result file positions.xtc
04:01:12:WU03:FS01:0x18:Folding@home Core Shutdown: FINISHED_UNIT
04:01:12:WU03:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
04:01:12:WU03:FS01:Sending unit results: id:03 state:SEND error:NO_ERROR project:9617 run:0 clone:39 gen:208 core:0x18 unit:0x000000ff0a3b1e815546e654c0f76ef4
04:01:12:WU03:FS01:Uploading 2.11MiB to 171.67.108.31
04:01:12:WU03:FS01:Connecting to 171.67.108.31:8080
04:01:18:WU03:FS01:Upload 26.70%
04:01:24:WU03:FS01:Upload 56.37%
04:01:30:WU03:FS01:Upload 86.03%
******************************* Date: 2015-07-27 *******************************
again... just date lines past this.

Re: 175.65.103.160

Posted: Wed Jul 29, 2015 3:56 am
by slymer
server finally came back and one WU sent... but one transfer keeps stalling at 95%
03:52:01:WU02:FS02:Sending unit results: id:02 state:SEND error:NO_ERROR project:9617 run:6 clone:44 gen:8 core:0x18 unit:0x000000090a3b1e815546f120ae6717ef
03:52:01:WU02:FS02:Uploading 2.04MiB to 171.67.108.31
03:52:01:WU02:FS02:Connecting to 171.67.108.31:8080
03:52:07:WU02:FS02:Upload 45.97%
03:52:13:WU02:FS02:Upload 95.01%

Re: 175.65.103.160

Posted: Wed Jul 29, 2015 1:20 pm
by slymer
All WUs finally uploaded. I had to restart the client 2 more times to get it to finally send that one completely.

Re: 175.65.103.160

Posted: Wed Jul 29, 2015 4:57 pm
by bruce
How long did it take to get to 95%? How long did it finally take to upload?

I suspect you were competing with a backload of uploads from other people and that restarting had nothing to do with it except to give you something to try.