Page 3 of 6

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 10:09 am
by uncle_fungus
It's likely to be a good few hours before anyone at Stanford can do anything given its still 3AM there (although it has been known for servers to be rebooted by researchers int the middle of the night).

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 10:19 am
by billford
Oddly, the Linux clients have been intermittently getting work from 171.67.108.204 so the loss isn't as great as it could be.

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 10:21 am
by SantaFe
I've no problem waiting, it's just after a recent Linux update blowup/reinstall for a second I thought it might be my hardware. ;)

I guess I could go there & hold up a sign reading "Will FOLD for WU'S" :D

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 10:25 am
by EXT64
I also hit this problem on my 4P/Linux, however it is at a new house so I thought it was related to the internet there. I'll be glad if it is in fact just a server glitch :D

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 12:56 pm
by aoeu
******************************* Date: 2015-05-06 *******************************
12:18:20:WU01:FS02:0x18:WARNING:Console control signal 1 on PID 1736
12:18:20:WU02:FS01:0x18:WARNING:Console control signal 1 on PID 4664
******************************* Date: 2015-05-06 *******************************
13:13:10:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.200:8080': Failed to connect to 171.67.108.200:8080: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
13:13:11:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.204:80': Empty work server assignment
13:13:11:ERROR:WU00:FS00:Exception: Could not get an assignment
13:13:32:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.200:8080': Failed to connect to 171.67.108.200:8080: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
13:13:33:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.204:80': Empty work server assignment
13:13:33:ERROR:WU00:FS00:Exception: Could not get an assignment
13:14:32:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.200:8080': Failed to connect to 171.67.108.200:8080: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
13:14:33:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.204:80': Empty work server assignment
13:14:33:ERROR:WU00:FS00:Exception: Could not get an assignment
13:16:10:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.200:8080': Failed to connect to 171.67.108.200:8080: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
13:16:10:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.204:80': Empty work server assignment
13:16:10:ERROR:WU00:FS00:Exception: Could not get an assignment

******************************* Date: 2015-05-07 *******************************
09:00:44:WARNING:WU02:FS02:Failed to get assignment from '171.67.108.200:80': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
09:39:36:WARNING:WU04:FS00:Failed to get assignment from '171.67.108.200:8080': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
09:39:37:WARNING:WU04:FS00:Failed to get assignment from '171.67.108.204:80': Empty work server assignment
09:39:37:ERROR:WU04:FS00:Exception: Could not get an assignment
14 attempts, so far.

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 1:33 pm
by VijayPande
Sorry for the delay here. We’re on it. This one is taking longer than we expected.

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 1:35 pm
by VijayPande
We’re also looking into why the failover didn’t work –– this is something which in principle shouldn’t happen unless there’s something serious going on.

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 1:37 pm
by uncle_fungus
I think the failover is working (we go from .200 to .204), however the compounding problem here is that the SMP server that clients are normally assigned to has gone into REJECT so most clients don't get any WUs.

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 2:10 pm
by VijayPande
ok, thanks. The main server serving SMP WUs is having issues and we’ve escalated the issue to Joseph. It’s 7:10am pacific time, so I expect he’ll take a look at this first thing in the morning, around 9am pacific time.

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 5:27 pm
by TomJohnson
At 10:25 AM Pacific time I got it working again by doing a Pause then a Fold on all of my Macs. :)

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 5:40 pm
by bruce
The same worked for me on Windows, so at least part of the problem is resolved.

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 5:42 pm
by sortofageek
I only had one slot, on a Windows box, unable to send/receive last night. This morning I tried Pause/Fold and also was able to get a new WU. Project: 9016 (Run 601, Clone 5, Gen 94) remains queued to send, but has not done so yet. (Correction below. It sent, just hasn't credited yet.)

The fact I'm able to get work is good news to me. :)


Correction: It appears that WU I mentioned actually did return with no error even though it shows up in my client as waiting to send.

Code: Select all

*********************** Log Started 2015-05-05T20:20:06Z ***********************
20:20:32:WU02:FS01:Starting
20:20:32:WU02:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/ProgramData/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 02 -suffix 01 -version 704 -lifeline 896 -checkpoint 15 -np 7
20:20:32:WU02:FS01:Started FahCore on PID 5752
20:20:33:WU02:FS01:Core PID:5764
20:20:33:WU02:FS01:FahCore 0xa4 started
20:20:33:WU02:FS01:0xa4:
20:20:33:WU02:FS01:0xa4:*------------------------------*
20:20:33:WU02:FS01:0xa4:Folding@Home Gromacs GB Core
20:20:33:WU02:FS01:0xa4:Version 2.27 (Dec. 15, 2010)
20:20:33:WU02:FS01:0xa4:
20:20:33:WU02:FS01:0xa4:Preparing to commence simulation
20:20:33:WU02:FS01:0xa4:- Looking at optimizations...
20:20:33:WU02:FS01:0xa4:- Files status OK
20:20:33:WU02:FS01:0xa4:- Expanded 826594 -> 1397548 (decompressed 169.0 percent)
20:20:33:WU02:FS01:0xa4:Called DecompressByteArray: compressed_data_size=826594 data_size=1397548, decompressed_data_size=1397548 diff=0
20:20:33:WU02:FS01:0xa4:- Digital signature verified
20:20:33:WU02:FS01:0xa4:
20:20:33:WU02:FS01:0xa4:Project: 9016 (Run 601, Clone 5, Gen 94)
20:20:33:WU02:FS01:0xa4:
20:20:33:WU02:FS01:0xa4:Assembly optimizations on if available.
20:20:33:WU02:FS01:0xa4:Entering M.D.
20:20:39:WU02:FS01:0xa4:Using Gromacs checkpoints
20:20:39:WU02:FS01:0xa4:Mapping NT from 7 to 7 
20:20:39:WU02:FS01:0xa4:Resuming from checkpoint
20:20:39:WU02:FS01:0xa4:Verified 02/wudata_01.log
20:20:39:WU02:FS01:0xa4:Verified 02/wudata_01.trr
20:20:39:WU02:FS01:0xa4:Verified 02/wudata_01.xtc
20:20:39:WU02:FS01:0xa4:Verified 02/wudata_01.edr
20:20:39:WU02:FS01:0xa4:Completed 133150 out of 250000 steps  (53%)
20:21:45:WU02:FS01:0xa4:Completed 135000 out of 250000 steps  (54%)
20:23:15:WU02:FS01:0xa4:Completed 137500 out of 250000 steps  (55%)
20:24:41:WU02:FS01:0xa4:Completed 140000 out of 250000 steps  (56%)
20:26:06:WU02:FS01:0xa4:Completed 142500 out of 250000 steps  (57%)
20:27:29:WU02:FS01:0xa4:Completed 145000 out of 250000 steps  (58%)
20:28:54:WU02:FS01:0xa4:Completed 147500 out of 250000 steps  (59%)
20:30:17:WU02:FS01:0xa4:Completed 150000 out of 250000 steps  (60%)
20:31:42:WU02:FS01:0xa4:Completed 152500 out of 250000 steps  (61%)
20:33:06:WU02:FS01:0xa4:Completed 155000 out of 250000 steps  (62%)
20:34:29:WU02:FS01:0xa4:Completed 157500 out of 250000 steps  (63%)
20:35:52:WU02:FS01:0xa4:Completed 160000 out of 250000 steps  (64%)
20:37:17:WU02:FS01:0xa4:Completed 162500 out of 250000 steps  (65%)
20:38:40:WU02:FS01:0xa4:Completed 165000 out of 250000 steps  (66%)
20:40:06:WU02:FS01:0xa4:Completed 167500 out of 250000 steps  (67%)
20:41:30:WU02:FS01:0xa4:Completed 170000 out of 250000 steps  (68%)
20:42:54:WU02:FS01:0xa4:Completed 172500 out of 250000 steps  (69%)
20:44:18:WU02:FS01:0xa4:Completed 175000 out of 250000 steps  (70%)
20:45:41:WU02:FS01:0xa4:Completed 177500 out of 250000 steps  (71%)
20:47:06:WU02:FS01:0xa4:Completed 180000 out of 250000 steps  (72%)
20:48:30:WU02:FS01:0xa4:Completed 182500 out of 250000 steps  (73%)
20:49:55:WU02:FS01:0xa4:Completed 185000 out of 250000 steps  (74%)
20:51:20:WU02:FS01:0xa4:Completed 187500 out of 250000 steps  (75%)
20:52:44:WU02:FS01:0xa4:Completed 190000 out of 250000 steps  (76%)
20:54:10:WU02:FS01:0xa4:Completed 192500 out of 250000 steps  (77%)
20:55:35:WU02:FS01:0xa4:Completed 195000 out of 250000 steps  (78%)
20:57:01:WU02:FS01:0xa4:Completed 197500 out of 250000 steps  (79%)
20:58:27:WU02:FS01:0xa4:Completed 200000 out of 250000 steps  (80%)
20:59:53:WU02:FS01:0xa4:Completed 202500 out of 250000 steps  (81%)
21:01:18:WU02:FS01:0xa4:Completed 205000 out of 250000 steps  (82%)
21:02:42:WU02:FS01:0xa4:Completed 207500 out of 250000 steps  (83%)
21:04:09:WU02:FS01:0xa4:Completed 210000 out of 250000 steps  (84%)
21:05:36:WU02:FS01:0xa4:Completed 212500 out of 250000 steps  (85%)
21:06:59:WU02:FS01:0xa4:Completed 215000 out of 250000 steps  (86%)
21:08:22:WU02:FS01:0xa4:Completed 217500 out of 250000 steps  (87%)
21:09:50:WU02:FS01:0xa4:Completed 220000 out of 250000 steps  (88%)
21:11:13:WU02:FS01:0xa4:Completed 222500 out of 250000 steps  (89%)
21:12:36:WU02:FS01:0xa4:Completed 225000 out of 250000 steps  (90%)
21:13:59:WU02:FS01:0xa4:Completed 227500 out of 250000 steps  (91%)
21:15:22:WU02:FS01:0xa4:Completed 230000 out of 250000 steps  (92%)
21:16:44:WU02:FS01:0xa4:Completed 232500 out of 250000 steps  (93%)
21:18:07:WU02:FS01:0xa4:Completed 235000 out of 250000 steps  (94%)
21:19:30:WU02:FS01:0xa4:Completed 237500 out of 250000 steps  (95%)
21:21:15:WU02:FS01:0xa4:Completed 240000 out of 250000 steps  (96%)
21:22:48:WU02:FS01:0xa4:Completed 242500 out of 250000 steps  (97%)
21:24:25:WU02:FS01:0xa4:Completed 245000 out of 250000 steps  (98%)
21:25:54:WU02:FS01:0xa4:Completed 247500 out of 250000 steps  (99%)
21:27:17:WU02:FS01:0xa4:Completed 250000 out of 250000 steps  (100%)
21:27:17:WU02:FS01:0xa4:DynamicWrapper: Finished Work Unit: sleep=10000
21:27:27:WU02:FS01:0xa4:
21:27:27:WU02:FS01:0xa4:Finished Work Unit:
21:27:27:WU02:FS01:0xa4:- Reading up to 811488 from "02/wudata_01.trr": Read 811488
21:27:27:WU02:FS01:0xa4:trr file hash check passed.
21:27:27:WU02:FS01:0xa4:- Reading up to 746112 from "02/wudata_01.xtc": Read 746112
21:27:27:WU02:FS01:0xa4:xtc file hash check passed.
21:27:27:WU02:FS01:0xa4:edr file hash check passed.
21:27:27:WU02:FS01:0xa4:logfile size: 24615
21:27:27:WU02:FS01:0xa4:Leaving Run
21:27:32:WU02:FS01:0xa4:- Writing 1584703 bytes of core data to disk...
21:27:32:WU02:FS01:0xa4:Done: 1584191 -> 1538483 (compressed to 97.1 percent)
21:27:32:WU02:FS01:0xa4:  ... Done.
21:27:32:WU02:FS01:0xa4:- Shutting down core
21:27:32:WU02:FS01:0xa4:
21:27:32:WU02:FS01:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
21:27:32:WU02:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
21:27:32:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:9016 run:601 clone:5 gen:94 core:0xa4 unit:0x00000070664f2de45491db6d99b9e9e1
21:27:32:WU02:FS01:Uploading 1.47MiB to 171.64.65.124
21:27:32:WU02:FS01:Connecting to 171.64.65.124:8080
Edits are mine. ~sorto'

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 5:59 pm
by billford
The error messages are different, but at least work is getting through after a few re-tries :)

Code: Select all

17:32:51:WU01:FS00:Connecting to 171.67.108.200:8080
17:32:52:WU01:FS00:Assigned to work server 171.64.65.124
17:32:52:WU01:FS00:Requesting new work unit for slot 00: RUNNING cpu:3 from 171.64.65.124
17:32:52:WU01:FS00:Connecting to 171.64.65.124:8080
17:32:54:WU01:FS00:Downloading 902.04KiB
.
.
17:33:54:WU01:FS00:Download 1.77%
17:33:54:ERROR:WU01:FS00:Exception: Transfer failed
17:33:54:WU01:FS00:Connecting to 171.67.108.200:8080
17:33:55:WU01:FS00:Assigned to work server 171.64.65.124
17:33:55:WU01:FS00:Requesting new work unit for slot 00: RUNNING cpu:3 from 171.64.65.124
17:33:55:WU01:FS00:Connecting to 171.64.65.124:8080
17:33:57:WU01:FS00:Downloading 902.10KiB
.
.
17:34:57:WU01:FS00:Download 1.77%
17:34:57:ERROR:WU01:FS00:Exception: Transfer failed
17:34:57:WU01:FS00:Connecting to 171.67.108.200:8080
17:34:58:WU01:FS00:Assigned to work server 171.64.65.124
17:34:58:WU01:FS00:Requesting new work unit for slot 00: READY cpu:3 from 171.64.65.124
17:34:58:WU01:FS00:Connecting to 171.64.65.124:8080
17:35:00:WU01:FS00:Downloading 903.08KiB
17:36:00:WU01:FS00:Download 1.77%
17:36:00:ERROR:WU01:FS00:Exception: Transfer failed
17:36:35:WU01:FS00:Connecting to 171.67.108.200:8080
17:36:35:WU01:FS00:Assigned to work server 155.247.166.220
17:36:35:WU01:FS00:Requesting new work unit for slot 00: READY cpu:3 from 155.247.166.220
17:36:35:WU01:FS00:Connecting to 155.247.166.220:8080
17:36:36:WU01:FS00:Downloading 199.11KiB
17:36:36:WU01:FS00:Download complete
17:36:36:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:6385 run:5 clone:3 gen:93 core:0xa4 unit:0x0000005d0002894c5417553beb2673b9
17:36:36:WU01:FS00:Starting

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 6:23 pm
by SantaFe
It took me 6 tries before it finally filled both slots. But it's working....... for NOW! ;)

Re: 171.67.108.200 - Internal Server Error

Posted: Thu May 07, 2015 7:51 pm
by uncle_fungus
We're probably battering the available WS now that the AS is working again.