Page 6 of 6

Re: Bigadv Collection and or Assignment server is broken

Posted: Mon Oct 01, 2012 9:34 am
by kasson
We are continuing to investigate but think it may be a WS-CS communication issue. More updates to come.

128.143.231.201

Posted: Mon Oct 01, 2012 3:04 pm
by TheWolf
Mod Edit: Post Merged.

No points for below work unit, please fix. Took me almost 2 full days to work this.

Completed: 110
[04:41:29] - Preparing to get new work unit...
[04:41:29] Cleaning up work directory
[04:41:29] + Attempting to get work packet
[04:41:29] Passkey found
[04:41:29] - Connecting to assignment server
[04:41:30] - Successful: assigned to (128.143.231.201)
[04:41:30] + News From Folding@Home: Welcome to Folding@Home
[04:41:30] Loaded queue successfully.
[04:42:21] + Closed connections
[04:42:21]
[04:42:21] + Processing work unit
[04:42:21] Core required: FahCore_a5.exe
[04:42:21] Core found.
[04:42:21] Working on queue slot 02 [September 29 04:42:21 UTC]
[04:42:21] + Working ...
[04:42:21]
[04:42:21] *------------------------------*
[04:42:21] Folding@Home Gromacs SMP Core
[04:42:21] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[04:42:21]
[04:42:21] Preparing to commence simulation
[04:42:21] - Looking at optimizations...
[04:42:21] - Created dyn
[04:42:21] - Files status OK
[04:42:23] - Expanded 30305526 -> 33158020 (decompressed 109.4 percent)
[04:42:23] Called DecompressByteArray: compressed_data_size=30305526 data_size=33158020, decompressed_data_size=33158020 diff=0
[04:42:24] - Digital signature verified
[04:42:24]
[04:42:24] Project: 8101 (Run 24, Clone 1, Gen 39)
[04:42:24]
[04:42:24] Assembly optimizations on if available.
[04:42:24] Entering M.D.
[04:42:31] Mapping NT from 18 to 18
[04:42:48] Completed 0 out of 250000 steps (0%)
[05:15:17] Completed 2500 out of 250000 steps (1%)
[05:47:10] Completed 5000 out of 250000 steps (2%)
[06:19:00] Completed 7500 out of 250000 steps (3%)
x
[05:55:54] Completed 235000 out of 250000 steps (94%)
[06:25:57] Completed 237500 out of 250000 steps (95%)
[06:55:58] Completed 240000 out of 250000 steps (96%)
[07:26:05] Completed 242500 out of 250000 steps (97%)
[07:56:12] Completed 245000 out of 250000 steps (98%)
[08:26:16] Completed 247500 out of 250000 steps (99%)
[08:56:20] Completed 250000 out of 250000 steps (100%)
[08:56:30] DynamicWrapper: Finished Work Unit: sleep=10000
[08:56:40]
[08:56:40] Finished Work Unit:
[08:56:40] - Reading up to 64340496 from "work/wudata_02.trr": Read 64340496
[08:56:40] trr file hash check passed.
[08:56:40] - Reading up to 31618496 from "work/wudata_02.xtc": Read 31618496
[08:56:41] xtc file hash check passed.
[08:56:41] edr file hash check passed.
[08:56:41] logfile size: 219703
[08:56:41] Leaving Run
[08:56:44] - Writing 96339571 bytes of core data to disk...
[08:56:59] Done: 96339059 -> 91562584 (compressed to 5.8 percent)
[08:56:59] ... Done.
[08:57:10] - Shutting down core
[08:57:10]
[08:57:10] Folding@home Core Shutdown: FINISHED_UNIT
[08:57:12] CoreStatus = 64 (100)
[08:57:12] Sending work to server
[08:57:12] Project: 8101 (Run 24, Clone 1, Gen 39)


[08:57:12] + Attempting to send results October 1 08:57:12 UTC
[09:19:29] - Server reports problem with unit.

[09:19:29] - Preparing to get new work unit...
[09:19:29] Cleaning up work directory
[09:19:29] + Attempting to get work packet
[09:19:29] Passkey found
[09:19:29] - Connecting to assignment server
[09:19:30] - Successful: assigned to (128.143.199.96).
[09:19:30] + News From Folding@Home: Welcome to

Re: 128.143.231.201 or Bigadv Collection server broken

Posted: Tue Oct 02, 2012 12:19 pm
by kasson
We have identified and fixed a WS-CS communication issue. This problem should be taken care of going forward; we are continuing to review the logs to analyze the impact of the problem on rejected work units.

128.143.231.201 or Bigadv Collection server broken #2

Posted: Wed Oct 03, 2012 11:55 am
by Grandpa_01
Since the other one is locked and this may be related to it I started another one. It appears that one of the members over at the team I fold for may still having the problem below is the quote from his report some one may want to check it out.
Mod Edit: Post Merged.
freeloader1969;1039194919 wrote: I've had two 8101's go bad for half a million points. I'll let this one finish and if it fails, I'll be shutting down my folding rigs until Stanford fixes their "problem". My latest one just failed this morning.
Quote:
Originally Posted by Grandpa_01 View Post
freeloader1969 what do you mean by failes, are you getting the server has a problem wit the unit message or are they getting eue errors ox8b erors you should not be getting the server error. If you are it needs to be reported over at the FF they can not fix an issue if they do not know about it. All of the messed up WU's should have been completed by now.
I got the "server has a problem with the unit" last night.
freeloader1969;1039196005 wrote: I got the "server has a problem with the unit" last night.

Code: Select all

[02:16:55] Completed 242500 out of 250000 steps  (97%)
[02:45:20] Completed 245000 out of 250000 steps  (98%)
[03:13:48] Completed 247500 out of 250000 steps  (99%)
[03:42:18] Completed 250000 out of 250000 steps  (100%)
[03:42:31] DynamicWrapper: Finished Work Unit: sleep=10000
[03:42:41] 
[03:42:41] Finished Work Unit:
[03:42:41] - Reading up to 64340496 from "work/wudata_04.trr": Read 64340496
[03:42:42] trr file hash check passed.
[03:42:42] - Reading up to 31616784 from "work/wudata_04.xtc": Read 31616784
[03:42:42] xtc file hash check passed.
[03:42:42] edr file hash check passed.
[03:42:42] logfile size: 203100
[03:42:42] Leaving Run
[03:42:42] - Writing 96321256 bytes of core data to disk...
[03:43:14] Done: 96320744 -> 91568336 (compressed to 5.8 percent)
[03:43:14]   ... Done.
[03:43:25] - Shutting down core
[03:43:25] 
[03:43:25] Folding@home Core Shutdown: FINISHED_UNIT
[03:43:27] CoreStatus = 64 (100)
[03:43:27] Sending work to server
[03:43:27] Project: 8101 (Run 22, Clone 1, Gen 60)


[03:43:27] + Attempting to send results [October 2 03:43:27 UTC]
[04:01:56] - Server reports problem with unit.
[04:01:56] - Preparing to get new work unit...
[04:01:56] Cleaning up work directory
http://hardforum.com/showthread.php?t=1719949

Re: 128.143.231.201 or Bigadv Collection server broken

Posted: Wed Oct 03, 2012 7:14 pm
by bruce
I'm going to unlock this topic. Note that
kasson wrote:This problem should be taken care of going forward; we are continuing to review the logs to analyze the impact of the problem on rejected work units.
That means several things.
1) WUs download AFTER the problem was fixed should have no problem.
2) The Pande Group already has enough information about WUs which were rejected for them to analyze the impact, so the DO NOT need posts from everybody saying "It happened to me and here are the details...." (I suspect that's the main reason the topic was locked, and it's still a valid reason NOT to see that type of posts.)
3) Until that analysis has been completed, don't expect any more information.

Re: 128.143.231.201 or Bigadv Collection server broken

Posted: Thu Oct 04, 2012 2:00 am
by Joe_H
Grandpa_01 wrote:It appears that one of the members over at the team I fold for may still having the problem below is the quote from his report some one may want to check it out.
There is insufficient information in the quoted material to determine one way or another if the problem still exists. Estimating backwards from the TPF of the last couple frames shown in the log excerpt, the WU could have been downloaded while the problem was still occurring. Or it could have been a re-download of one of the problem WU's issued during that period. If freeloader is still having this problem, have him post about it in this forum.

As for the WU, two others have successfully turned in WU's with the same PRCG as shown being uploaded in the log.

Re: 128.143.231.201 or Bigadv Collection server broken

Posted: Thu Oct 04, 2012 10:19 pm
by PinHead
All of my rejected WU were refolded and uploaded successfully. All new WU downloaded after the fix ( since it takes a day to refold a wu ) processed normally. I did not experience multiple rejects of the same WU.

Thanks for the quick response Kasson

Re: 128.143.231.201 or Bigadv Collection server broken

Posted: Wed Oct 31, 2012 9:42 pm
by 3.0charlie
Reviving an old thread... started folding again on the 4P rig doing -bigadv WUs. As of today, first 2 WUs have failed to upload. Servers are 128.143.231.201 and 128.143.199.97.

WUs are :

Project: 8101 (Run 3, Clone 5, Gen 90)
Project: 8101 (Run 3, Clone 2, Gen 94)

And now folding Project: 8101 (Run 16, Clone 1, Gen 57). I have very little hope that this third WU will be successfully uploaded.

Code: Select all

[17:41:29] - Preparing to get new work unit...
[17:41:29] Cleaning up work directory
[17:41:29] + Attempting to get work packet
[17:41:29] Passkey found
[17:41:29] - Connecting to assignment server
[17:41:39] - Successful: assigned to (128.143.231.201).
[17:41:39] + News From Folding@Home: Welcome to Folding@Home
[17:41:39] Loaded queue successfully.
[17:41:53] + Closed connections
[17:41:53] 
[17:41:53] + Processing work unit
[17:41:53] Core required: FahCore_a5.exe
[17:41:53] Core found.
[17:41:53] Working on queue slot 03 [October 31 17:41:53 UTC]
[17:41:53] + Working ...
thekraken: The Kraken 0.7-pre15 (compiled Sun Oct 28 20:27:39 EDT 2012 by folding@Linux-Server)
thekraken: Processor affinity wrapper for Folding@Home
thekraken: The Kraken comes with ABSOLUTELY NO WARRANTY; licensed under GPLv2
thekraken: PID: 4582
thekraken: Logging to thekraken.log
[17:41:53] 
[17:41:53] *------------------------------*
[17:41:53] Folding@Home Gromacs SMP Core
[17:41:53] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[17:41:53] 
[17:41:53] Preparing to commence simulation
[17:41:53] - Looking at optimizations...
[17:41:53] - Created dyn
[17:41:53] - Files status OK
[17:41:57] - Expanded 30305865 -> 33158020 (decompressed 109.4 percent)
[17:41:57] Called DecompressByteArray: compressed_data_size=30305865 data_size=33158020, decompressed_data_size=33158020 diff=0
[17:41:58] - Digital signature verified
[17:41:58] 
[17:41:58] Project: 8101 (Run 16, Clone 1, Gen 57)
[17:41:58] 
[17:41:58] Assembly optimizations on if available.
[17:41:58] Entering M.D.
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                            :-)  VERSION 4.5.3  (-:

        Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
      Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra, 
        Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff, 
           Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz, 
                Michael Shirts, Alfons Sijbers, Peter Tieleman,

               Berk Hess, David van der Spoel, and Erik Lindahl.

       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
            Copyright (c) 2001-2010, The GROMACS development team at
        Uppsala University & The Royal Institute of Technology, Sweden.
            check out http://www.gromacs.org for more information.


                               :-)  Gromacs  (-:

Reading file work/wudata_03.tpr, VERSION 4.5.5-dev-20120903-d64b9e3 (single precision)
[17:42:05] Mapping NT from 48 to 48 
Starting 48 threads
Making 2D domain decomposition 12 x 4 x 1
starting mdrun 'FP_membrane in water'
14500000 steps,  58000.0 ps (continuing from step 14250000,  57000.0 ps).
[17:42:10] Completed 0 out of 250000 steps  (0%)

NOTE: Turning on dynamic load balancing

[17:53:09] - Couldn't send HTTP request to server
[17:53:09] + Could not connect to Work Server (results)
[17:53:09]     (128.143.231.201:80)
[17:53:09] - Error: Could not transmit unit 01 (completed October 30) to work server.


[17:53:09] + Attempting to send results [October 31 17:53:09 UTC]
[17:53:09] - Couldn't send HTTP request to server
[17:53:09] + Could not connect to Work Server (results)
[17:53:09]     (128.143.199.97:8080)
[17:53:09] + Retrying using alternative port
[17:53:09] - Couldn't send HTTP request to server
[17:53:09] + Could not connect to Work Server (results)
[17:53:09]     (128.143.199.97:80)
[17:53:09]   Could not transmit unit 01 to Collection server; keeping in queue.
[17:53:09] Project: 8101 (Run 3, Clone 5, Gen 90)


[17:53:09] + Attempting to send results [October 31 17:53:09 UTC]
[18:05:00] Completed 2500 out of 250000 steps  (1%)
[18:10:22] - Couldn't send HTTP request to server
[18:10:22] + Could not connect to Work Server (results)
[18:10:22]     (128.143.231.201:8080)
[18:10:22] + Retrying using alternative port
[18:27:37] - Couldn't send HTTP request to server
[18:27:37] + Could not connect to Work Server (results)
[18:27:37]     (128.143.231.201:80)
[18:27:37] - Error: Could not transmit unit 02 (completed October 31) to work server.
[18:27:37]   Keeping unit 02 in queue.
[18:47:21] Completed 5000 out of 250000 steps  (2%)
[19:22:46] Completed 7500 out of 250000 steps  (3%)
[20:03:08] Completed 10000 out of 250000 steps  (4%)
[20:34:26] Completed 12500 out of 250000 steps  (5%)
[21:06:07] Completed 15000 out of 250000 steps  (6%)
[21:33:16] Completed 17500 out of 250000 steps  (7%)

Re: 128.143.231.201 or Bigadv Collection server broken

Posted: Wed Oct 31, 2012 10:15 pm
by bruce
Are you able to open these in your browser?
http://128.143.231.201:8080
http://128.143.231.201
http://128.143.199.97:8080
http://128.143.199.97

I get a blank page (no error messages) from each of them.

Re: 128.143.231.201 or Bigadv Collection server broken

Posted: Thu Nov 01, 2012 2:19 am
by 3.0charlie
Negative on all four - being servers, I should see 'OK', right?
I actually tried it from my desktop (read: daily computer), and I see a blank page too.

Firewall? I'm using PfSense.

Re: 128.143.231.201 or Bigadv Collection server broken

Posted: Thu Nov 01, 2012 2:30 am
by bruce
Only some servers will show the "OK" Others will return an empty page.

What's important is whether you do or do not get an error message. The blank page confirms you made contact, even without the OK.

Yes, it most certainly could be your firewall. Pause the slot. close any program that might connect to the internet. Turn OFF your firewall (briefly). Start folding again. Does anything upload? If so, figure out how to allow FAH to penetrate your firewall.

Re: 128.143.231.201 or Bigadv Collection server broken

Posted: Thu Nov 01, 2012 2:33 am
by 3.0charlie
Just did a traceroute of 128.143.231.201. Traceroute output:

1 0.0.0.0 (0.0.0.0) 8.011 ms 7.610 ms 6.801 ms
2 10.170.178.49 (10.170.178.49) 12.048 ms 17.905 ms 12.106 ms
3 10.170.168.194 (10.170.168.194) 12.998 ms 15.562 ms 11.820 ms
4 216.113.123.69 (216.113.123.69) 14.607 ms 11.342 ms 12.148 ms
5 216.113.124.90 (216.113.124.90) 29.638 ms 27.815 ms 23.801 ms
6 * equinix-ash.ntelos.net (206.223.115.156) 48.613 ms 20.409 ms
7 216-24-99-62.unassigned.ntelos.net (216.24.99.62) 29.199 ms 27.991 ms 28.642 ms
8 206-248-255-146.unassigned.ntelos.net (206.248.255.146) 35.746 ms 47.179 ms 28.119 ms
9 carruthers-6509a-x.misc.Virginia.EDU (128.143.222.92) 28.150 ms 28.444 ms 28.155 ms
10 * * *


Ends there at the 18th hop. Is it normal that it ends @ 128.143.222.92?

Re: 128.143.231.201 or Bigadv Collection server broken

Posted: Thu Nov 01, 2012 2:58 am
by PinHead
you seem to be dying one short. Missing pmks04.med.Virginia.EDU [128.143.231.201]. I get 97 ms on 128.143.222.92.

But they are in Virginia that was also smacked by Sandy.

Re: 128.143.231.201 or Bigadv Collection server broken

Posted: Thu Nov 01, 2012 4:34 pm
by bruce
PinHead wrote:you seem to be dying one short. Missing pmks04.med.Virginia.EDU [128.143.231.201]. I get 97 ms on 128.143.222.92.

But they are in Virginia that was also smacked by Sandy.
True, and I was concerned about that, too, but it doesn't seem to have taken the server out of service.

Any router/server can be configured NOT to respond to pings so trace route might skip designated single steps from time to time. I can't speak for pmks04 specifically.

There are two important facts:
1) You can open the (blank) web page at that URL so the server is responding properly to html messages.
2) Serverstat shows variations in WUs Rcv and in WUs To Go so it's actually doing what it's supposed to be doing, just not for 3.0charlie.

Re: 128.143.231.201 or Bigadv Collection server broken

Posted: Thu Nov 01, 2012 8:30 pm
by 3.0charlie
Thank you both for the feed-back, I'll see to have Pfsense offline temporarely to confirm that one of my settings is wrong.