171.64.65.64 overloaded

GreyWhiskers · Post by **GreyWhiskers** » Sun May 22, 2011 5:12 am

BTW, server has been back up since Sat May 21 20:25:10 PDT 2011 - I was on the next to last queue item before it wrapped around - so I never saw the skipping of the occupied WU.

NicoPalm · Post by **NicoPalm** » Mon May 23, 2011 4:56 pm

Hi,

I have a WU thts been stuck trying to upload for about 48 hours now. I was told to post here by more experienced folders. I'm running just my processor in SMP2 using the V7 client. I do not have the full log, as I re-booted my system to see it if would help, which it didn't.

Many Thanks,
Nico

Code: Select all

14:24:33:Connecting to 171.67.108.25:8080
14:24:34:WARNING: WorkServer connection failed on port 8080 trying 80
14:24:34:Connecting to 171.67.108.25:80
14:24:36:ERROR: Exception: Failed to connect to 171.67.108.25:80: No connection could be made because the target machine actively refused it.
14:31:24:Sending unit results: id:00 state:SEND project:6054 run:1 clone:80 gen:230 core:0xa3 unit:0x12f940e64dd7de9900e60050000117a6
14:31:24:Unit 00: Uploading 17.17KiB
14:31:24:Connecting to 171.64.65.54:8080
14:31:24:WARNING: Exception: Failed to send results to work server: Upload failed
14:31:24:Trying to send results to collection server
14:31:24:Unit 00: Uploading 17.17KiB
14:31:24:Connecting to 171.67.108.25:8080
14:31:25:WARNING: WorkServer connection failed on port 8080 trying 80
14:31:25:Connecting to 171.67.108.25:80
14:31:27:ERROR: Exception: Failed to connect to 171.67.108.25:80: No connection could be made because the target machine actively refused it.
14:33:14:Unit 01:Completed 475000 out of 500000 steps  (95%)
14:42:29:Sending unit results: id:00 state:SEND project:6054 run:1 clone:80 gen:230 core:0xa3 unit:0x12f940e64dd7de9900e60050000117a6
14:42:29:Unit 00: Uploading 17.17KiB
14:42:29:Connecting to 171.64.65.54:8080
14:42:30:WARNING: Exception: Failed to send results to work server: Upload failed
14:42:30:Trying to send results to collection server
14:42:30:Unit 00: Uploading 17.17KiB
14:42:30:Connecting to 171.67.108.25:8080
14:42:31:WARNING: WorkServer connection failed on port 8080 trying 80
14:42:31:Connecting to 171.67.108.25:80
14:42:32:ERROR: Exception: Failed to connect to 171.67.108.25:80: No connection could be made because the target machine actively refused it.
14:46:23:Unit 01:Completed 480000 out of 500000 steps  (96%)
14:59:41:Unit 01:Completed 485000 out of 500000 steps  (97%)
15:00:26:Sending unit results: id:00 state:SEND project:6054 run:1 clone:80 gen:230 core:0xa3 unit:0x12f940e64dd7de9900e60050000117a6
15:00:26:Unit 00: Uploading 17.17KiB
15:00:26:Connecting to 171.64.65.54:8080
15:00:26:WARNING: Exception: Failed to send results to work server: Upload failed
15:00:26:Trying to send results to collection server
15:00:27:Unit 00: Uploading 17.17KiB
15:00:27:Connecting to 171.67.108.25:8080
15:00:28:WARNING: WorkServer connection failed on port 8080 trying 80
15:00:28:Connecting to 171.67.108.25:80
15:00:29:ERROR: Exception: Failed to connect to 171.67.108.25:80: No connection could be made because the target machine actively refused it.
15:15:01:Unit 01:Completed 490000 out of 500000 steps  (98%)
15:28:39:Unit 01:Completed 495000 out of 500000 steps  (99%)
15:28:40:Connecting to assign3.stanford.edu:8080
15:28:40:News: Welcome to Folding@Home
15:28:40:Assigned to work server 128.143.199.96
15:28:40:Requesting new work unit for slot 00: RUNNING smp:2 from 128.143.199.96
15:28:40:Connecting to 128.143.199.96:8080
15:28:41:Slot 00: Downloading 1.69MiB
15:28:47:Slot 00: 72.94%
15:28:48:Slot 00: Download complete
15:28:48:Received Unit: id:02 state:DOWNLOAD project:6974 run:0 clone:88 gen:60 core:0xa3 unit:0x00000047fbcb017c4d80d1ae092d4ace
15:29:28:Sending unit results: id:00 state:SEND project:6054 run:1 clone:80 gen:230 core:0xa3 unit:0x12f940e64dd7de9900e60050000117a6
15:29:28:Unit 00: Uploading 17.17KiB
15:29:28:Connecting to 171.64.65.54:8080
15:29:29:WARNING: Exception: Failed to send results to work server: Upload failed
15:29:29:Trying to send results to collection server
15:29:29:Unit 00: Uploading 17.17KiB
15:29:29:Connecting to 171.67.108.25:8080
15:29:30:WARNING: WorkServer connection failed on port 8080 trying 80
15:29:30:Connecting to 171.67.108.25:80
15:29:31:ERROR: Exception: Failed to connect to 171.67.108.25:80: No connection could be made because the target machine actively refused it.
15:56:31:Unit 01:Completed 500000 out of 500000 steps  (100%)
15:56:33:Unit 01:DynamicWrapper: Finished Work Unit: sleep=10000
15:56:43:Unit 01:
15:56:43:Unit 01:Finished Work Unit:
15:56:43:Unit 01:- Reading up to 3701664 from "01/wudata_01.trr": Read 3701664
15:56:43:Unit 01:trr file hash check passed.
15:56:43:Unit 01:edr file hash check passed.
15:56:43:Unit 01:logfile size: 70584
15:56:43:Unit 01:Leaving Run
15:56:47:Unit 01:- Writing 3807184 bytes of core data to disk...
15:56:48:Unit 01:Done: 3806672 -> 3523779 (compressed to 92.5 percent)
15:56:48:Unit 01:  ... Done.
15:56:49:Unit 01:- Shutting down core
15:56:49:Unit 01:
15:56:49:Unit 01:Folding@home Core Shutdown: FINISHED_UNIT
15:56:50:FahCore, running Unit 01, returned: FINISHED_UNIT (100)
15:56:50:Sending unit results: id:01 state:SEND project:6962 run:0 clone:97 gen:83 core:0xa3 unit:0x00000056fbcb017c4d80cdd821d875da
15:56:50:Unit 01: Uploading 3.36MiB
15:56:50:Starting Unit 02
15:56:50:Connecting to 128.143.199.96:8080
15:56:50:Running core: C:/Users/Nico/FaHBeta/cores/www.stanford.edu/~pande/Win32/x86/Core_a3.fah/FahCore_a3.exe -dir 02 -suffix 01 -lifeline 3180 -version 701 -checkpoint 15 -np 2
15:56:50:Started core on PID 1524
15:56:50:FahCore 0xa3 started
15:56:50:Started thread 9 on PID 3180
15:56:50:Unit 02:
15:56:50:Unit 02:*------------------------------*
15:56:50:Unit 02:Folding@Home Gromacs SMP Core
15:56:50:Unit 02:Version 2.27 (Dec. 15, 2010)
15:56:50:Unit 02:
15:56:50:Unit 02:Preparing to commence simulation
15:56:50:Unit 02:- Looking at optimizations...
15:56:50:Unit 02:- Created dyn
15:56:50:Unit 02:- Files status OK
15:56:50:Unit 02:- Expanded 1768493 -> 1957708 (decompressed 110.6 percent)
15:56:51:Unit 02:Called DecompressByteArray: compressed_data_size=1768493 data_size=1957708, decompressed_data_size=1957708 diff=0
15:56:51:Unit 02:- Digital signature verified
15:56:51:Unit 02:
15:56:51:Unit 02:Project: 6974 (Run 0, Clone 88, Gen 60)
15:56:51:Unit 02:
15:56:51:Unit 02:Assembly optimizations on if available.
15:56:51:Unit 02:Entering M.D.
15:56:56:Unit 01: 9.53%
15:56:57:Unit 02:Mapping NT from 2 to 2 
15:56:57:Unit 02:Completed 0 out of 500000 steps  (0%)
15:57:02:Unit 01: 19.87%
15:57:08:Unit 01: 29.99%
15:57:14:Unit 01: 40.45%
15:57:20:Unit 01: 50.79%
15:57:26:Unit 01: 60.90%
15:57:32:Unit 01: 70.90%
15:57:38:Unit 01: 81.24%
15:57:44:Unit 01: 91.47%
15:57:49:Unit 01: Upload complete
15:57:49:Server responded WORK_ACK (400)
15:57:49:Final credit estimate, 1458.00 points
15:57:50:Cleaning up Unit 01
16:09:58:Unit 02:Completed 5000 out of 500000 steps  (1%)
16:16:27:Sending unit results: id:00 state:SEND project:6054 run:1 clone:80 gen:230 core:0xa3 unit:0x12f940e64dd7de9900e60050000117a6
16:16:27:Unit 00: Uploading 17.17KiB
16:16:27:Connecting to 171.64.65.54:8080
16:16:27:WARNING: Exception: Failed to send results to work server: Upload failed
16:16:27:Trying to send results to collection server
16:16:28:Unit 00: Uploading 17.17KiB
16:16:28:Connecting to 171.67.108.25:8080
16:16:29:WARNING: WorkServer connection failed on port 8080 trying 80
16:16:29:Connecting to 171.67.108.25:80
16:16:30:ERROR: Exception: Failed to connect to 171.67.108.25:80: No connection could be made because the target machine actively refused it.
16:22:50:Unit 02:Completed 10000 out of 500000 steps  (2%)
16:36:09:Unit 02:Completed 15000 out of 500000 steps  (3%)
16:48:58:Unit 02:Completed 20000 out of 500000 steps  (4%)

Post by **bruce** » Mon May 23, 2011 7:09 pm

Welcome to foldingforum.org, NicoPalm

First of all, ignore the messages about 171.67.108.25. There are known issues with that server.

Second, notice that each time the client tries to upload the WU it tries four times ... once or twice to another server then twice to 171.67.108.25. It's the other server that we need to pay attention to. In your case, we need to focus on 171.64.65.54. Those uploads are failing, too, but eventually they'll be the one that's successful. I'll change the title of the topic to reflect that fact.

The server log shows that the server had some problems during the night (California time) but I don't see any problems right not. Has your WU finally uploaded?

l67swap · Post by **l67swap** » Wed May 25, 2011 2:19 am

looks like the server is down once again , came home to all 3 of my 580's were stuck trying to upload

codysluder · Post by **codysluder** » Wed May 25, 2011 3:53 am

l67swap wrote:looks like the server is down once again , came home to all 3 of my 580's were stuck trying to upload

Are you getting new WUs from another server?

jclu52 · Post by **jclu52** » Wed May 25, 2011 4:02 am

According to server log @ http://fah-web.stanford.edu/logs/171.64.65.64.log.html

Code: Select all

Tue May 24 18:40:10 PDT 2011	171.64.65.64	GPU	vspg2v	lin5	full	Reject
Tue May 24 19:15:10 PDT 2011	171.64.65.64	GPU	vspg2v	lin5	full	Accepting

Server 171.64.65.64 is working again since May 24, 19:15:10 PDT. So no worry. Folding along.

l67swap · Post by **l67swap** » Wed May 25, 2011 5:38 am

just started the clients back up and grabbed some 900 pointers on all 3 cards , so hopefully all goes well during the night

GreyWhiskers · Post by **GreyWhiskers** » Sun May 29, 2011 4:41 am

Back down again. I noticed it about 30 minutes ago (5/29 at 0406 GMT). One of the p6801 WUs waiting for upload. one of the 925 point Project: 10952 (Run 2, Clone 71, Gen 23) WUs was assigned after several failed attempts to get a new WU. Nvidia GPU WUs.

Report initiated on Sat May 28 21:00:09 PDT 2011.
171.64.65.64 GPU vspg2v lin5 full Reject 0.76 43 3 32 17883 4597 1 0 135443 135443 135443

EDIT: back up sooner than I expected -

[04:43:48] + Results successfully sent
[04:43:48] Thank you for your contribution to Folding@Home.
[04:43:48] + Number of Units Completed: 638

Xavier Zepherious · Post by **Xavier Zepherious** » Tue Jun 21, 2011 3:22 am

it's down again today - rejecting

it won't take completed jobs
[02:48:56] Connecting to http://171.64.65.64:8080/
[02:48:57] - Couldn't send HTTP request to server
[02:48:57] + Could not connect to Work Server (results)
[02:48:57] (171.64.65.64:8080)
[02:48:57] + Retrying using alternative port
[02:48:57] Connecting to http://171.64.65.64:80/
[02:48:59] - Couldn't send HTTP request to server
[02:48:59] + Could not connect to Work Server (results)
[02:48:59] (171.64.65.64:80)
[02:48:59] - Error: Could not transmit unit 01 (completed June 21) to work server.
[02:48:59] - 9 failed uploads of this unit.
[02:48:59] - Read packet limit of 540015616... Set to 524286976.

[02:48:59] + Attempting to send results [June 21 02:48:59 UTC]
[02:48:59] - Reading file work/wuresults_01.dat from core
[02:48:59] (Read 2506610 bytes from disk)
[02:48:59] Gpu type=3 species=30.
[02:48:59] Connecting to http://171.67.108.26:8080/
[02:49:41] Posted data.
[02:49:41] Initial: 0000; - Uploaded at ~58 kB/s
[02:49:41] - Averaged speed for that direction ~47 kB/s
[02:49:41] - Server does not have record of this unit. Will try again later.
[02:49:41] Could not transmit unit 01 to Collection server; keeping in queue.
[02:49:41] + Sent 0 of 1 completed units to the server
[02:50:11] Trying to send all finished work units
[02:50:11] Project: 6806 (Run 3435, Clone 1, Gen 44)
[02:50:11] - Read packet limit of 540015616... Set to 524286976.

[02:50:11] + Attempting to send results [June 21 02:50:11 UTC]
[02:50:11] - Reading file work/wuresults_01.dat from core
[02:50:11] (Read 2506610 bytes from disk)
[02:50:11] Gpu type=3 species=30.
[02:50:11] Connecting to http://171.64.65.64:8080/
[02:50:12] - Couldn't send HTTP request to server
[02:50:12] + Could not connect to Work Server (results)
[02:50:12] (171.64.65.64:8080)
[02:50:12] + Retrying using alternative port
[02:50:12] Connecting to http://171.64.65.64:80/
[02:50:13] - Couldn't send HTTP request to server
[02:50:13] + Could not connect to Work Server (results)
[02:50:13] (171.64.65.64:80)
[02:50:13] - Error: Could not transmit unit 01 (completed June 21) to work server.
[02:50:13] - 10 failed uploads of this unit.

Connecting to http://171.67.108.26:8080/
[02:54:35] - Couldn't send HTTP request to server
[02:54:35] + Could not connect to Work Server (results)
[02:54:35] (171.67.108.26:8080)
[02:54:35] + Retrying using alternative port
[02:54:35] Connecting to http://171.67.108.26:80/
[02:54:36] - Couldn't send HTTP request to server
[02:54:36] + Could not connect to Work Server (results)
[02:54:36] (171.67.108.26:80)
[02:54:36] Could not transmit unit 01 to Collection server; keeping in queue.

GreyWhiskers · Post by **GreyWhiskers** » Tue Jun 21, 2011 4:10 am

Server stats show it's been rejecting since the report Mon Jun 20 17:35:10 PDT 2011 continuing through the latest report (Mon Jun 20 19:55:10 PDT 2011).

For myself, if first affected me at Mon Jun 20 17:59:00 PDT 2011, but it recovered quickly - it got assigned a WU from 171.67.108.32.

at 20:08, it must have been back up, because the pending WUs got uploaded, and I'm now processing a new download from 171.64.65.64.

Xavier Zepherious · Post by **Xavier Zepherious** » Tue Jun 21, 2011 4:49 am

yea it's back up

when it goes down like that FAH tracker gets stuck trying to upload and cannot go on (stuck in a loop)
then no work gets done..can't progress on until it clears it

it requires that I stop the GPU client in the tracker and restart it..it then procedes to get a new WU
it stores the WU until the next time and if it's down then I have 2 WU that are in the queue and I have to do it again...until the server is up
annoying at best

I wish Jedi would fix this issue with FAH gpu tracker v2

Post by **bruce** » Tue Jun 21, 2011 5:25 am

It's not too likely that the 3rd party developers will be doing much with software written for V6. The V7 client is in open beta and it's a total rewrite. Most of the methods used to support V6 will no longer work in V7 (without a total rewrite). You will have to talk to the developer, himself, though.

xposer · Post by **xposer** » Tue Jun 21, 2011 5:42 am

Since you locked my previous post before I could post more examples ...... have a look where this one stops at 9:57 with
[09:57:14] + Attempting to send results [June 20 09:57:14 UTC]
[09:57:14] Gpu type=3 species=30.
And the next thing it shows is this
[12:34:56] + Could not connect to Work Server (results)
Which is roughly two and a half hours that the gpu 3 client has remained idle, neither trying to send in the finished wu nor trying to download a new wu.
So please do not tell me, "but you chopped off the log before the important part". That was where it stopped. Since it was going to be idle for a few hours, I restarted the client.

I believe this is a client code error, not a server error (as you suggested). Remaining idle for two and a half hours doesn't sound normal. I hope you will bring this situation to the attention of the Panda Group organization programmers.

Code: Select all

[09:56:47] Completed  49999999 out of 50000000 steps (100%).
[09:56:48] Finished fah_main
[09:56:48] 
[09:56:48] Successful run
[09:56:48] DynamicWrapper: Finished Work Unit: sleep=10000
[09:56:57] Reserved 2335128 bytes for xtc file; Cosm status=0
[09:56:57] Allocated 2335128 bytes for xtc file
[09:56:57] - Reading up to 2335128 from "work/wudata_08.xtc": Read 2335128
[09:56:57] Read 2335128 bytes from xtc file; available packet space=784095336
[09:56:57] xtc file hash check passed.
[09:56:57] Reserved 72360 72360 784095336 bytes for arc file=<work/wudata_08.trr> Cosm status=0
[09:56:57] Allocated 72360 bytes for arc file
[09:56:57] - Reading up to 72360 from "work/wudata_08.trr": Read 72360
[09:56:57] Read 72360 bytes from arc file; available packet space=784022976
[09:56:57] trr file hash check passed.
[09:56:57] Allocated 544 bytes for edr file
[09:56:57] Read bedfile
[09:56:57] edr file hash check passed.
[09:56:57] Allocated 120132 bytes for logfile
[09:56:57] Read logfile
[09:56:57] GuardedRun: success in DynamicWrapper
[09:56:57] GuardedRun: done
[09:56:57] Run: GuardedRun completed.
[09:57:01] + Opened results file
[09:57:01] - Writing 2528676 bytes of core data to disk...
[09:57:02] Done: 2528164 -> 2369426 (compressed to 93.7 percent)
[09:57:02]   ... Done.
[09:57:02] DeleteFrameFiles: successfully deleted file=work/wudata_08.ckp
[09:57:05] Shutting down core 
[09:57:05] 
[09:57:05] Folding@home Core Shutdown: FINISHED_UNIT
[09:57:08] CoreStatus = 64 (100)
[09:57:08] Sending work to server
[09:57:08] Project: 6805 (Run 9428, Clone 1, Gen 36)
[09:57:08] - Read packet limit of 540015616... Set to 524286976.


[09:57:08] + Attempting to send results [June 20 09:57:08 UTC]
[09:57:08] Gpu type=3 species=30.
[09:57:10] - Couldn't send HTTP request to server
[09:57:10] + Could not connect to Work Server (results)
[09:57:10]     (171.64.65.64:8080)
[09:57:10] + Retrying using alternative port
[09:57:11] - Couldn't send HTTP request to server
[09:57:11] + Could not connect to Work Server (results)
[09:57:11]     (171.64.65.64:80)
[09:57:11] - Error: Could not transmit unit 08 (completed June 20) to work server.
[09:57:11]   Keeping unit 08 in queue.
[09:57:11] Project: 6805 (Run 9428, Clone 1, Gen 36)
[09:57:11] - Read packet limit of 540015616... Set to 524286976.


[09:57:11] + Attempting to send results [June 20 09:57:11 UTC]
[09:57:11] Gpu type=3 species=30.
[09:57:12] - Couldn't send HTTP request to server
[09:57:12] + Could not connect to Work Server (results)
[09:57:12]     (171.64.65.64:8080)
[09:57:12] + Retrying using alternative port
[09:57:14] - Couldn't send HTTP request to server
[09:57:14] + Could not connect to Work Server (results)
[09:57:14]     (171.64.65.64:80)
[09:57:14] - Error: Could not transmit unit 08 (completed June 20) to work server.
[09:57:14] - Read packet limit of 540015616... Set to 524286976.


[09:57:14] + Attempting to send results [June 20 09:57:14 UTC]
[09:57:14] Gpu type=3 species=30.
[12:34:56] + Could not connect to Work Server (results)
[12:34:56]     (171.67.108.26:8080)
[12:34:56] + Retrying using alternative port
[12:34:57] - Couldn't send HTTP request to server
[12:34:57] + Could not connect to Work Server (results)
[12:34:57]     (171.67.108.26:80)
[12:34:57]   Could not transmit unit 08 to Collection server; keeping in queue.
[12:34:57] - Preparing to get new work unit...
[12:34:57] Cleaning up work directory
[12:34:57] + Attempting to get work packet
[12:34:57] Passkey found
[12:34:57] Gpu type=3 species=30.
[12:34:57] - Connecting to assignment server

P5-133XL · Post by **P5-133XL** » Tue Jun 21, 2011 7:21 am

Did you even read the post that locked the thread? It actually gave you a link to where you could post further comments rather than just start a new thread. The point was to keep related posts together rather than have a whole bunch of related but independent threads. By not posting where he instructed, all you've done is just produce another thread that will be locked when Bruce gets to it ...

xposer · Post by **xposer** » Tue Jun 21, 2011 2:55 pm

Thanks for the reply P5-133XL .
It's too bad that I read the original response before it was edited and the link was added.
The post is about the clients actions, not a down server.

The cpu and gpu clients will normally keep trying to send in work units at slowly increasing time intervals.

In my case, that's not happening. It happens to try sending the wu a couple of times and then sits idle for a couple of hours.
It doesn't retry sending in the wu at slowly increasing intervals.
I thought this was a problem the Pande Group might like to be made aware of.

So, lock it, delete it, do whatever you want, I did my part in trying to point out a glitch in the gpu3 code.

Folding Forum

171.64.65.64 overloaded

Re: 171.64.65.64 overloaded

171.64.65.54 and 171.67.108.25

Re: 171.67.108.25

Re: 171.64.65.64 overloaded

Re: 171.64.65.64 overloaded

Re: 171.64.65.64 overloaded

Re: 171.64.65.64 overloaded

Re: 171.64.65.64 overloaded

Re: 171.64.65.64 overloaded

Re: 171.64.65.64 overloaded

Re: 171.64.65.64 overloaded

Re: 171.64.65.64 overloaded

more Couldn't send HTTP request to server

Re: more Couldn't send HTTP request to server

Re: more Couldn't send HTTP request to server