Page 5 of 6

Re: 171.67.108.200 - Internal Server Error

Posted: Sun May 17, 2015 12:59 pm
by aoeu
First the warnings and Errors Log starting just before the current problem.

Code: Select all

******************************* Date: 2015-05-16 *******************************
******************************* Date: 2015-05-16 *******************************
******************************* Date: 2015-05-16 *******************************
16:09:51:WARNING:WU03:FS02:Failed to get assignment from '171.67.108.200:80': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
17:45:10:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.200:8080': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
******************************* Date: 2015-05-16 *******************************
21:51:44:WARNING:WU02:FS00:Failed to get assignment from '171.67.108.200:8080': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
23:29:18:WARNING:WU00:FS01:Failed to get assignment from '171.67.108.200:80': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
00:54:43:WARNING:WU01:FS02:Failed to get assignment from '171.67.108.200:80': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
02:01:30:WARNING:WU03:FS00:Failed to get assignment from '171.67.108.200:8080': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
******************************* Date: 2015-05-17 *******************************
05:52:39:WARNING:WU02:FS00:Failed to get assignment from '171.67.108.200:8080': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
07:51:55:WARNING:WU03:FS01:Failed to get assignment from '171.67.108.200:80': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
******************************* Date: 2015-05-17 *******************************
09:47:46:WU03:FS01:0x18:WARNING:Console control signal 1 on PID 3388
09:47:46:WU01:FS02:0x18:WARNING:Console control signal 1 on PID 5204
09:47:46:WU03:FS01:0x18:ERROR:103: Lost client lifeline
12:33:28:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.200:8080': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
Here is the Log from just before the last error.

Code: Select all

12:33:27:WU02:FS00:0xa4:Completed 4950000 out of 5000000 steps  (99%)
12:33:28:WU00:FS00:Connecting to 171.67.108.200:8080
12:33:28:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.200:8080': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
12:33:28:WU00:FS00:Connecting to 171.67.108.204:80
12:33:29:WU00:FS00:Assigned to work server 155.247.166.219
12:33:29:WU00:FS00:Requesting new work unit for slot 00: RUNNING cpu:6 from 155.247.166.219
12:33:29:WU00:FS00:Connecting to 155.247.166.219:8080
12:33:29:WU00:FS00:Downloading 116.65KiB
12:33:29:WU00:FS00:Download complete
12:33:29:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:6398 run:143 clone:22 gen:85 core:0xa4 unit:0x0000005f0002894b5462cbc3412244b2
12:35:17:WU01:FS02:0x18:Completed 2650000 out of 5000000 steps (53%)
12:35:51:WU03:FS01:0x18:Completed 1650000 out of 5000000 steps (33%)
12:37:18:WU02:FS00:0xa4:Completed 5000000 out of 5000000 steps  (100%)
12:37:18:WU02:FS00:0xa4:DynamicWrapper: Finished Work Unit: sleep=10000
12:37:28:WU02:FS00:0xa4:
12:37:28:WU02:FS00:0xa4:Finished Work Unit:
12:37:28:WU02:FS00:0xa4:- Reading up to 1256376 from "02/wudata_01.trr": Read 1256376
12:37:28:WU02:FS00:0xa4:trr file hash check passed.
12:37:28:WU02:FS00:0xa4:- Reading up to 111504 from "02/wudata_01.xtc": Read 111504
12:37:28:WU02:FS00:0xa4:xtc file hash check passed.
12:37:28:WU02:FS00:0xa4:edr file hash check passed.
12:37:28:WU02:FS00:0xa4:logfile size: 88690
12:37:28:WU02:FS00:0xa4:Leaving Run
12:37:31:WU02:FS00:0xa4:- Writing 1528470 bytes of core data to disk...
12:37:32:WU02:FS00:0xa4:Done: 1527958 -> 1293596 (compressed to 84.6 percent)
12:37:32:WU02:FS00:0xa4:  ... Done.
12:37:33:WU02:FS00:0xa4:- Shutting down core
12:37:33:WU02:FS00:0xa4:
12:37:33:WU02:FS00:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
12:37:33:WU02:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
12:37:33:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:6395 run:41 clone:38 gen:114 core:0xa4 unit:0x000000780002894b5462c743312d9110
12:37:33:WU02:FS00:Uploading 1.23MiB to 155.247.166.219
12:37:33:WU02:FS00:Connecting to 155.247.166.219:8080
12:37:33:WU00:FS00:Starting
12:37:33:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/aoeu/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 00 -suffix 01 -version 704 -lifeline 2204 -checkpoint 15 -np 6
12:37:33:WU00:FS00:Started FahCore on PID 5964
12:37:33:WU00:FS00:Core PID:1772
12:37:33:WU00:FS00:FahCore 0xa4 started
12:37:34:WU00:FS00:0xa4:
12:37:34:WU00:FS00:0xa4:*------------------------------*
12:37:34:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
12:37:34:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
12:37:34:WU00:FS00:0xa4:
12:37:34:WU00:FS00:0xa4:Preparing to commence simulation
12:37:34:WU00:FS00:0xa4:- Looking at optimizations...
12:37:34:WU00:FS00:0xa4:- Created dyn
12:37:34:WU00:FS00:0xa4:- Files status OK
12:37:34:WU00:FS00:0xa4:- Expanded 118937 -> 270464 (decompressed 227.4 percent)
12:37:34:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=118937 data_size=270464, decompressed_data_size=270464 diff=0
12:37:34:WU00:FS00:0xa4:- Digital signature verified
12:37:34:WU00:FS00:0xa4:
12:37:34:WU00:FS00:0xa4:Project: 6398 (Run 143, Clone 22, Gen 85)
12:37:34:WU00:FS00:0xa4:
12:37:34:WU00:FS00:0xa4:Assembly optimizations on if available.
12:37:34:WU00:FS00:0xa4:Entering M.D.
12:37:36:WU02:FS00:Upload complete
12:37:36:WU02:FS00:Server responded WORK_ACK (400)
12:37:36:WU02:FS00:Final credit estimate, 1515.00 points
12:37:36:WU02:FS00:Cleaning up
12:37:39:WU00:FS00:0xa4:Mapping NT from 6 to 6 
12:37:39:WU00:FS00:0xa4:Completed 0 out of 5000000 steps  (0%)
12:41:10:WU00:FS00:0xa4:Completed 50000 out of 5000000 steps  (1%)
12:41:55:WU03:FS01:0x18:Completed 1700000 out of 5000000 steps (34%)
12:44:42:WU00:FS00:0xa4:Completed 100000 out of 5000000 steps  (2%)
12:47:02:WU01:FS02:0x18:Completed 2700000 out of 5000000 steps (54%)
12:47:58:WU03:FS01:0x18:Completed 1750000 out of 5000000 steps (35%)
12:48:03:WU00:FS00:0xa4:Completed 150000 out of 5000000 steps  (3%)
12:51:36:WU00:FS00:0xa4:Completed 200000 out of 5000000 steps  (4%)
I have three slots, all have WUs, and are folding.
I hope this is helpful.

Re: 171.67.108.200 - Internal Server Error

Posted: Sun May 17, 2015 2:18 pm
by VijayPande
We had an electrical storm (with several brief power outages) which put some of our servers and the Stanford net into a weird state. I think we have the WS’s back in shape, but we’ll keep an eye on this.

Re: 171.67.108.200 - Internal Server Error

Posted: Sun May 17, 2015 2:18 pm
by VijayPande
As for the project descriptions, they are on fab-web which is one of the key machines having problems.

I’ve taken a look at it and there’s a more serious issue with a particular server than I can take care of myself. I’ve filed a ticket with the sysadmin team. Best guess ETA on this being fixed is Monday at noon (assuming this is something simple).

Re: 171.67.108.200 - Internal Server Error

Posted: Mon May 18, 2015 7:46 am
by billford
VijayPande wrote:I think we have the WS’s back in shape, but we’ll keep an eye on this.
Somewhat belated response, I've been otherwise occupied.

Thanks for the info, WS assignment seems OK now.

Re: 171.67.108.200 - Internal Server Error

Posted: Mon May 18, 2015 2:34 pm
by Simplex0
VijayPande wrote:We had an electrical storm (with several brief power outages) which put some of our servers and the Stanford net into a weird state. I think we have the WS’s back in shape, but we’ll keep an eye on this.
Really? ........

"NO IMPACT: A CME expected to sideswipe Earth's magnetic field on May 17th either missed or its impact was undetectable. As a result, geomagnetic activity is low and likely to remain so for the next 24 to 48 hours."

http://spaceweather.com/

Re: 171.67.108.200 - Internal Server Error

Posted: Mon May 18, 2015 3:30 pm
by DemonfangArun
VijayPande wrote:We had an electrical storm (with several brief power outages) which put some of our servers and the Stanford net into a weird state. I think we have the WS’s back in shape, but we’ll keep an eye on this.
nothing in the budget for at least some cheap consumer grade line interactive ups's (or some proper server grade online units)? i would think a place like Stanford could at least get something to keep servers online through power blips (and shut down servers properly!).

Re: 171.67.108.200 - Internal Server Error

Posted: Mon May 18, 2015 4:31 pm
by 7im
Simplex0 wrote:
VijayPande wrote:We had an electrical storm (with several brief power outages) which put some of our servers and the Stanford net into a weird state. I think we have the WS’s back in shape, but we’ll keep an eye on this.
Really? ........

"NO IMPACT: A CME expected to sideswipe Earth's magnetic field on May 17th either missed or its impact was undetectable. As a result, geomagnetic activity is low and likely to remain so for the next 24 to 48 hours."

http://spaceweather.com/
The late to arrive El Nino is fueling larger and more severe storms in the Southwest.
http://www.noaa.gov

Re: 171.67.108.200 - Internal Server Error

Posted: Mon May 18, 2015 4:35 pm
by billford
Stats (and project descriptions) would seem to be back again :)

Re: 171.67.108.200 - Internal Server Error

Posted: Mon May 18, 2015 6:35 pm
by bruce
DemonfangArun wrote:
VijayPande wrote:We had an electrical storm (with several brief power outages) which put some of our servers and the Stanford net into a weird state. I think we have the WS’s back in shape, but we’ll keep an eye on this.
nothing in the budget for at least some cheap consumer grade line interactive ups's (or some proper server grade online units)? i would think a place like Stanford could at least get something to keep servers online through power blips (and shut down servers properly!).
Apparently the servers were protected by (I suppose: server-grade) UPSs but something else wasn't (like maybe a key router). See Update on fah-web May 2015

Re: 171.67.108.200 - Internal Server Error

Posted: Mon May 18, 2015 6:57 pm
by Simplex0
7im wrote:
Simplex0 wrote:
VijayPande wrote:We had an electrical storm (with several brief power outages) which put some of our servers and the Stanford net into a weird state. I think we have the WS’s back in shape, but we’ll keep an eye on this.
Really? ........

"NO IMPACT: A CME expected to sideswipe Earth's magnetic field on May 17th either missed or its impact was undetectable. As a result, geomagnetic activity is low and likely to remain so for the next 24 to 48 hours."

http://spaceweather.com/
The late to arrive El Nino is fueling larger and more severe storms in the Southwest.
http://www.noaa.gov
In this case the words was "electrical storm" which the sun is able to triger but not, as far as i know, El Nino.

Re: 171.67.108.200 - Internal Server Error

Posted: Mon May 18, 2015 7:05 pm
by DemonfangArun
bruce wrote:
DemonfangArun wrote:
VijayPande wrote:We had an electrical storm (with several brief power outages) which put some of our servers and the Stanford net into a weird state. I think we have the WS’s back in shape, but we’ll keep an eye on this.
nothing in the budget for at least some cheap consumer grade line interactive ups's (or some proper server grade online units)? i would think a place like Stanford could at least get something to keep servers online through power blips (and shut down servers properly!).
Apparently the servers were protected by (I suppose: server-grade) UPSs but something else wasn't (like maybe a key router). See Update on fah-web May 2015
that would've been my next guess. ups all the things is my typical strategy, because upsing a computer to prevent gaming interruptions does no good if network goes off for example. anyway, glad it's fixed and it should be interesting to see how the 3PM eoc update fare..chokes... jk

Re: 171.67.108.200 - Internal Server Error

Posted: Mon May 18, 2015 7:06 pm
by billford
Simplex0 wrote: In this case the words was "electrical storm" which the sun is able to triger but not, as far as i know, El Nino.
CME's cause geomagnetic storms not electrical storms- they're just common or garden thunderstorms.

Re: 171.67.108.200 - Internal Server Error

Posted: Mon May 18, 2015 8:39 pm
by sco01
20:36:10:WARNING:WU01:FS01:Failed to get assignment from '171.67.108.200:8080': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
20:36:10:WU01:FS01:Connecting to 171.67.108.204:80
20:36:11:WARNING:WU01:FS01:Failed to get assignment from '171.67.108.204:80': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
20:36:11:ERROR:WU01:FS01:Exception: Could not get an assignment

Re: 171.67.108.200 - Internal Server Error

Posted: Mon May 18, 2015 8:49 pm
by billford
OK here, but different port:

Code: Select all

20:42:36:WU00:FS01:0x17:Completed 4950000 out of 5000000 steps (99%)
20:44:01:WU00:FS01:0x17:Completed 5000000 out of 5000000 steps (100%)
20:44:01:WU01:FS01:Connecting to 171.67.108.200:80
20:44:02:WU01:FS01:Assigned to work server 171.64.65.56
20:44:02:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GM204 [GeForce GTX 980] from 171.64.65.56
20:44:02:WU01:FS01:Connecting to 171.64.65.56:8080
20:44:03:WU01:FS01:Downloading 889.82KiB
20:44:03:WU00:FS01:Connecting to 171.67.108.52:8080
20:44:05:WU01:FS01:Download complete
20:44:05:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:9411 run:715 clone:0 gen:79 core:0x17 unit:0x00000061ab40413854d27c3131671be8
20:44:05:WU01:FS01:Starting

Re: 171.67.108.200 - Internal Server Error

Posted: Mon May 18, 2015 8:53 pm
by 7im
bruce wrote:
Apparently the servers were protected by (I suppose: server-grade) UPSs but something else wasn't (like maybe a key router). See Update on fah-web May 2015
Even server grade UPS systems eventually shutdown if the Main power is out for too long, and there is no generator to fall back on. Also explains the heavy load when power was restored. And I think Dr. Pande meant PERC Reset, as in a Dell PowerEdge Expandable RAID Controller. ;)