128.143.231.201 or Bigadv Collection server broken

Moderators: Site Moderators, FAHC Science Team

kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: Bigadv Collection and or Assignment server is broken

Post by kasson »

We are continuing to investigate but think it may be a WS-CS communication issue. More updates to come.
TheWolf
Posts: 288
Joined: Thu Jan 24, 2008 10:34 am

128.143.231.201

Post by TheWolf »

Mod Edit: Post Merged.

No points for below work unit, please fix. Took me almost 2 full days to work this.

Completed: 110
[04:41:29] - Preparing to get new work unit...
[04:41:29] Cleaning up work directory
[04:41:29] + Attempting to get work packet
[04:41:29] Passkey found
[04:41:29] - Connecting to assignment server
[04:41:30] - Successful: assigned to (128.143.231.201)
[04:41:30] + News From Folding@Home: Welcome to Folding@Home
[04:41:30] Loaded queue successfully.
[04:42:21] + Closed connections
[04:42:21]
[04:42:21] + Processing work unit
[04:42:21] Core required: FahCore_a5.exe
[04:42:21] Core found.
[04:42:21] Working on queue slot 02 [September 29 04:42:21 UTC]
[04:42:21] + Working ...
[04:42:21]
[04:42:21] *------------------------------*
[04:42:21] Folding@Home Gromacs SMP Core
[04:42:21] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[04:42:21]
[04:42:21] Preparing to commence simulation
[04:42:21] - Looking at optimizations...
[04:42:21] - Created dyn
[04:42:21] - Files status OK
[04:42:23] - Expanded 30305526 -> 33158020 (decompressed 109.4 percent)
[04:42:23] Called DecompressByteArray: compressed_data_size=30305526 data_size=33158020, decompressed_data_size=33158020 diff=0
[04:42:24] - Digital signature verified
[04:42:24]
[04:42:24] Project: 8101 (Run 24, Clone 1, Gen 39)
[04:42:24]
[04:42:24] Assembly optimizations on if available.
[04:42:24] Entering M.D.
[04:42:31] Mapping NT from 18 to 18
[04:42:48] Completed 0 out of 250000 steps (0%)
[05:15:17] Completed 2500 out of 250000 steps (1%)
[05:47:10] Completed 5000 out of 250000 steps (2%)
[06:19:00] Completed 7500 out of 250000 steps (3%)
x
[05:55:54] Completed 235000 out of 250000 steps (94%)
[06:25:57] Completed 237500 out of 250000 steps (95%)
[06:55:58] Completed 240000 out of 250000 steps (96%)
[07:26:05] Completed 242500 out of 250000 steps (97%)
[07:56:12] Completed 245000 out of 250000 steps (98%)
[08:26:16] Completed 247500 out of 250000 steps (99%)
[08:56:20] Completed 250000 out of 250000 steps (100%)
[08:56:30] DynamicWrapper: Finished Work Unit: sleep=10000
[08:56:40]
[08:56:40] Finished Work Unit:
[08:56:40] - Reading up to 64340496 from "work/wudata_02.trr": Read 64340496
[08:56:40] trr file hash check passed.
[08:56:40] - Reading up to 31618496 from "work/wudata_02.xtc": Read 31618496
[08:56:41] xtc file hash check passed.
[08:56:41] edr file hash check passed.
[08:56:41] logfile size: 219703
[08:56:41] Leaving Run
[08:56:44] - Writing 96339571 bytes of core data to disk...
[08:56:59] Done: 96339059 -> 91562584 (compressed to 5.8 percent)
[08:56:59] ... Done.
[08:57:10] - Shutting down core
[08:57:10]
[08:57:10] Folding@home Core Shutdown: FINISHED_UNIT
[08:57:12] CoreStatus = 64 (100)
[08:57:12] Sending work to server
[08:57:12] Project: 8101 (Run 24, Clone 1, Gen 39)


[08:57:12] + Attempting to send results October 1 08:57:12 UTC
[09:19:29] - Server reports problem with unit.

[09:19:29] - Preparing to get new work unit...
[09:19:29] Cleaning up work directory
[09:19:29] + Attempting to get work packet
[09:19:29] Passkey found
[09:19:29] - Connecting to assignment server
[09:19:30] - Successful: assigned to (128.143.199.96).
[09:19:30] + News From Folding@Home: Welcome to
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: 128.143.231.201 or Bigadv Collection server broken

Post by kasson »

We have identified and fixed a WS-CS communication issue. This problem should be taken care of going forward; we are continuing to review the logs to analyze the impact of the problem on rejected work units.
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

128.143.231.201 or Bigadv Collection server broken #2

Post by Grandpa_01 »

Since the other one is locked and this may be related to it I started another one. It appears that one of the members over at the team I fold for may still having the problem below is the quote from his report some one may want to check it out.
Mod Edit: Post Merged.
freeloader1969;1039194919 wrote: I've had two 8101's go bad for half a million points. I'll let this one finish and if it fails, I'll be shutting down my folding rigs until Stanford fixes their "problem". My latest one just failed this morning.
Quote:
Originally Posted by Grandpa_01 View Post
freeloader1969 what do you mean by failes, are you getting the server has a problem wit the unit message or are they getting eue errors ox8b erors you should not be getting the server error. If you are it needs to be reported over at the FF they can not fix an issue if they do not know about it. All of the messed up WU's should have been completed by now.
I got the "server has a problem with the unit" last night.
freeloader1969;1039196005 wrote: I got the "server has a problem with the unit" last night.

Code: Select all

[02:16:55] Completed 242500 out of 250000 steps  (97%)
[02:45:20] Completed 245000 out of 250000 steps  (98%)
[03:13:48] Completed 247500 out of 250000 steps  (99%)
[03:42:18] Completed 250000 out of 250000 steps  (100%)
[03:42:31] DynamicWrapper: Finished Work Unit: sleep=10000
[03:42:41] 
[03:42:41] Finished Work Unit:
[03:42:41] - Reading up to 64340496 from "work/wudata_04.trr": Read 64340496
[03:42:42] trr file hash check passed.
[03:42:42] - Reading up to 31616784 from "work/wudata_04.xtc": Read 31616784
[03:42:42] xtc file hash check passed.
[03:42:42] edr file hash check passed.
[03:42:42] logfile size: 203100
[03:42:42] Leaving Run
[03:42:42] - Writing 96321256 bytes of core data to disk...
[03:43:14] Done: 96320744 -> 91568336 (compressed to 5.8 percent)
[03:43:14]   ... Done.
[03:43:25] - Shutting down core
[03:43:25] 
[03:43:25] Folding@home Core Shutdown: FINISHED_UNIT
[03:43:27] CoreStatus = 64 (100)
[03:43:27] Sending work to server
[03:43:27] Project: 8101 (Run 22, Clone 1, Gen 60)


[03:43:27] + Attempting to send results [October 2 03:43:27 UTC]
[04:01:56] - Server reports problem with unit.
[04:01:56] - Preparing to get new work unit...
[04:01:56] Cleaning up work directory
http://hardforum.com/showthread.php?t=1719949
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 128.143.231.201 or Bigadv Collection server broken

Post by bruce »

I'm going to unlock this topic. Note that
kasson wrote:This problem should be taken care of going forward; we are continuing to review the logs to analyze the impact of the problem on rejected work units.
That means several things.
1) WUs download AFTER the problem was fixed should have no problem.
2) The Pande Group already has enough information about WUs which were rejected for them to analyze the impact, so the DO NOT need posts from everybody saying "It happened to me and here are the details...." (I suspect that's the main reason the topic was locked, and it's still a valid reason NOT to see that type of posts.)
3) Until that analysis has been completed, don't expect any more information.
Joe_H
Site Admin
Posts: 7929
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 128.143.231.201 or Bigadv Collection server broken

Post by Joe_H »

Grandpa_01 wrote:It appears that one of the members over at the team I fold for may still having the problem below is the quote from his report some one may want to check it out.
There is insufficient information in the quoted material to determine one way or another if the problem still exists. Estimating backwards from the TPF of the last couple frames shown in the log excerpt, the WU could have been downloaded while the problem was still occurring. Or it could have been a re-download of one of the problem WU's issued during that period. If freeloader is still having this problem, have him post about it in this forum.

As for the WU, two others have successfully turned in WU's with the same PRCG as shown being uploaded in the log.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
PinHead
Posts: 285
Joined: Tue Jan 24, 2012 3:43 am
Hardware configuration: Quad Q9550 2.83 contains the GPU 57xx - running SMP and GPU
Quad Q6700 2.66 running just SMP
2P 32core Interlagos SMP on linux

Re: 128.143.231.201 or Bigadv Collection server broken

Post by PinHead »

All of my rejected WU were refolded and uploaded successfully. All new WU downloaded after the fix ( since it takes a day to refold a wu ) processed normally. I did not experience multiple rejects of the same WU.

Thanks for the quick response Kasson
3.0charlie
Posts: 13
Joined: Wed Jul 29, 2009 4:34 pm

Re: 128.143.231.201 or Bigadv Collection server broken

Post by 3.0charlie »

Reviving an old thread... started folding again on the 4P rig doing -bigadv WUs. As of today, first 2 WUs have failed to upload. Servers are 128.143.231.201 and 128.143.199.97.

WUs are :

Project: 8101 (Run 3, Clone 5, Gen 90)
Project: 8101 (Run 3, Clone 2, Gen 94)

And now folding Project: 8101 (Run 16, Clone 1, Gen 57). I have very little hope that this third WU will be successfully uploaded.

Code: Select all

[17:41:29] - Preparing to get new work unit...
[17:41:29] Cleaning up work directory
[17:41:29] + Attempting to get work packet
[17:41:29] Passkey found
[17:41:29] - Connecting to assignment server
[17:41:39] - Successful: assigned to (128.143.231.201).
[17:41:39] + News From Folding@Home: Welcome to Folding@Home
[17:41:39] Loaded queue successfully.
[17:41:53] + Closed connections
[17:41:53] 
[17:41:53] + Processing work unit
[17:41:53] Core required: FahCore_a5.exe
[17:41:53] Core found.
[17:41:53] Working on queue slot 03 [October 31 17:41:53 UTC]
[17:41:53] + Working ...
thekraken: The Kraken 0.7-pre15 (compiled Sun Oct 28 20:27:39 EDT 2012 by folding@Linux-Server)
thekraken: Processor affinity wrapper for Folding@Home
thekraken: The Kraken comes with ABSOLUTELY NO WARRANTY; licensed under GPLv2
thekraken: PID: 4582
thekraken: Logging to thekraken.log
[17:41:53] 
[17:41:53] *------------------------------*
[17:41:53] Folding@Home Gromacs SMP Core
[17:41:53] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[17:41:53] 
[17:41:53] Preparing to commence simulation
[17:41:53] - Looking at optimizations...
[17:41:53] - Created dyn
[17:41:53] - Files status OK
[17:41:57] - Expanded 30305865 -> 33158020 (decompressed 109.4 percent)
[17:41:57] Called DecompressByteArray: compressed_data_size=30305865 data_size=33158020, decompressed_data_size=33158020 diff=0
[17:41:58] - Digital signature verified
[17:41:58] 
[17:41:58] Project: 8101 (Run 16, Clone 1, Gen 57)
[17:41:58] 
[17:41:58] Assembly optimizations on if available.
[17:41:58] Entering M.D.
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                            :-)  VERSION 4.5.3  (-:

        Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
      Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra, 
        Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff, 
           Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz, 
                Michael Shirts, Alfons Sijbers, Peter Tieleman,

               Berk Hess, David van der Spoel, and Erik Lindahl.

       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
            Copyright (c) 2001-2010, The GROMACS development team at
        Uppsala University & The Royal Institute of Technology, Sweden.
            check out http://www.gromacs.org for more information.


                               :-)  Gromacs  (-:

Reading file work/wudata_03.tpr, VERSION 4.5.5-dev-20120903-d64b9e3 (single precision)
[17:42:05] Mapping NT from 48 to 48 
Starting 48 threads
Making 2D domain decomposition 12 x 4 x 1
starting mdrun 'FP_membrane in water'
14500000 steps,  58000.0 ps (continuing from step 14250000,  57000.0 ps).
[17:42:10] Completed 0 out of 250000 steps  (0%)

NOTE: Turning on dynamic load balancing

[17:53:09] - Couldn't send HTTP request to server
[17:53:09] + Could not connect to Work Server (results)
[17:53:09]     (128.143.231.201:80)
[17:53:09] - Error: Could not transmit unit 01 (completed October 30) to work server.


[17:53:09] + Attempting to send results [October 31 17:53:09 UTC]
[17:53:09] - Couldn't send HTTP request to server
[17:53:09] + Could not connect to Work Server (results)
[17:53:09]     (128.143.199.97:8080)
[17:53:09] + Retrying using alternative port
[17:53:09] - Couldn't send HTTP request to server
[17:53:09] + Could not connect to Work Server (results)
[17:53:09]     (128.143.199.97:80)
[17:53:09]   Could not transmit unit 01 to Collection server; keeping in queue.
[17:53:09] Project: 8101 (Run 3, Clone 5, Gen 90)


[17:53:09] + Attempting to send results [October 31 17:53:09 UTC]
[18:05:00] Completed 2500 out of 250000 steps  (1%)
[18:10:22] - Couldn't send HTTP request to server
[18:10:22] + Could not connect to Work Server (results)
[18:10:22]     (128.143.231.201:8080)
[18:10:22] + Retrying using alternative port
[18:27:37] - Couldn't send HTTP request to server
[18:27:37] + Could not connect to Work Server (results)
[18:27:37]     (128.143.231.201:80)
[18:27:37] - Error: Could not transmit unit 02 (completed October 31) to work server.
[18:27:37]   Keeping unit 02 in queue.
[18:47:21] Completed 5000 out of 250000 steps  (2%)
[19:22:46] Completed 7500 out of 250000 steps  (3%)
[20:03:08] Completed 10000 out of 250000 steps  (4%)
[20:34:26] Completed 12500 out of 250000 steps  (5%)
[21:06:07] Completed 15000 out of 250000 steps  (6%)
[21:33:16] Completed 17500 out of 250000 steps  (7%)
Folding for Hardware Canucks
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 128.143.231.201 or Bigadv Collection server broken

Post by bruce »

Are you able to open these in your browser?
http://128.143.231.201:8080
http://128.143.231.201
http://128.143.199.97:8080
http://128.143.199.97

I get a blank page (no error messages) from each of them.
3.0charlie
Posts: 13
Joined: Wed Jul 29, 2009 4:34 pm

Re: 128.143.231.201 or Bigadv Collection server broken

Post by 3.0charlie »

Negative on all four - being servers, I should see 'OK', right?
I actually tried it from my desktop (read: daily computer), and I see a blank page too.

Firewall? I'm using PfSense.
Folding for Hardware Canucks
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 128.143.231.201 or Bigadv Collection server broken

Post by bruce »

Only some servers will show the "OK" Others will return an empty page.

What's important is whether you do or do not get an error message. The blank page confirms you made contact, even without the OK.

Yes, it most certainly could be your firewall. Pause the slot. close any program that might connect to the internet. Turn OFF your firewall (briefly). Start folding again. Does anything upload? If so, figure out how to allow FAH to penetrate your firewall.
3.0charlie
Posts: 13
Joined: Wed Jul 29, 2009 4:34 pm

Re: 128.143.231.201 or Bigadv Collection server broken

Post by 3.0charlie »

Just did a traceroute of 128.143.231.201. Traceroute output:

1 0.0.0.0 (0.0.0.0) 8.011 ms 7.610 ms 6.801 ms
2 10.170.178.49 (10.170.178.49) 12.048 ms 17.905 ms 12.106 ms
3 10.170.168.194 (10.170.168.194) 12.998 ms 15.562 ms 11.820 ms
4 216.113.123.69 (216.113.123.69) 14.607 ms 11.342 ms 12.148 ms
5 216.113.124.90 (216.113.124.90) 29.638 ms 27.815 ms 23.801 ms
6 * equinix-ash.ntelos.net (206.223.115.156) 48.613 ms 20.409 ms
7 216-24-99-62.unassigned.ntelos.net (216.24.99.62) 29.199 ms 27.991 ms 28.642 ms
8 206-248-255-146.unassigned.ntelos.net (206.248.255.146) 35.746 ms 47.179 ms 28.119 ms
9 carruthers-6509a-x.misc.Virginia.EDU (128.143.222.92) 28.150 ms 28.444 ms 28.155 ms
10 * * *


Ends there at the 18th hop. Is it normal that it ends @ 128.143.222.92?
Folding for Hardware Canucks
PinHead
Posts: 285
Joined: Tue Jan 24, 2012 3:43 am
Hardware configuration: Quad Q9550 2.83 contains the GPU 57xx - running SMP and GPU
Quad Q6700 2.66 running just SMP
2P 32core Interlagos SMP on linux

Re: 128.143.231.201 or Bigadv Collection server broken

Post by PinHead »

you seem to be dying one short. Missing pmks04.med.Virginia.EDU [128.143.231.201]. I get 97 ms on 128.143.222.92.

But they are in Virginia that was also smacked by Sandy.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 128.143.231.201 or Bigadv Collection server broken

Post by bruce »

PinHead wrote:you seem to be dying one short. Missing pmks04.med.Virginia.EDU [128.143.231.201]. I get 97 ms on 128.143.222.92.

But they are in Virginia that was also smacked by Sandy.
True, and I was concerned about that, too, but it doesn't seem to have taken the server out of service.

Any router/server can be configured NOT to respond to pings so trace route might skip designated single steps from time to time. I can't speak for pmks04 specifically.

There are two important facts:
1) You can open the (blank) web page at that URL so the server is responding properly to html messages.
2) Serverstat shows variations in WUs Rcv and in WUs To Go so it's actually doing what it's supposed to be doing, just not for 3.0charlie.
3.0charlie
Posts: 13
Joined: Wed Jul 29, 2009 4:34 pm

Re: 128.143.231.201 or Bigadv Collection server broken

Post by 3.0charlie »

Thank you both for the feed-back, I'll see to have Pfsense offline temporarely to confirm that one of my settings is wrong.
Folding for Hardware Canucks
Post Reply