171.64.65.64 overloaded

Moderators: Site Moderators, FAHC Science Team

noorman
Posts: 270
Joined: Sun Dec 02, 2007 2:26 pm
Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+
Location: Belgium, near the International Sea-Port of Antwerp

Re: 171.64.65.64 overloaded

Post by noorman »

Same reports from my friend; it 's not possible to get a constant stream of Work and swift uploads.
That way production is seriously hampered and very inefficient; the donor systems are running idle for too long on average!

He runs a farm with a mix of SMP and multi-GPU, almost all exactly the same systems (hardware and software).
Configs also the same.
This is hurting production really badly!

Can someone take control of that server ( 171.64.65.64 ) and repair it, please?

.
- stopped Linux SMP w. HT on i7-860@3.5 GHz
....................................
Folded since 10-06-04 till 09-2010
noorman
Posts: 270
Joined: Sun Dec 02, 2007 2:26 pm
Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+
Location: Belgium, near the International Sea-Port of Antwerp

Re: 171.64.65.64 overloaded

Post by noorman »

.
Now it 's 171.64.65.64 GPU vspg2v lin5 full Reject 1.92 0 0 6 17883 5049 in REJECT again ! :cry:
.
- stopped Linux SMP w. HT on i7-860@3.5 GHz
....................................
Folded since 10-06-04 till 09-2010
GreyWhiskers
Posts: 660
Joined: Mon Oct 25, 2010 5:57 am
Hardware configuration: a) Main unit
Sandybridge in HAF922 w/200 mm side fan
--i7 2600K@4.2 GHz
--ASUS P8P67 DeluxeB3
--4GB ADATA 1600 RAM
--750W Corsair PS
--2Seagate Hyb 750&500 GB--WD Caviar Black 1TB
--EVGA 660GTX-Ti FTW - Signature 2 GPU@ 1241 Boost
--MSI GTX560Ti @900MHz
--Win7Home64; FAH V7.3.2; 327.23 drivers

b) 2004 HP a475c desktop, 1 core Pent 4 HT@3.2 GHz; Mem 2GB;HDD 160 GB;Zotac GT430PCI@900 MHz
WinXP SP3-32 FAH v7.3.6 301.42 drivers - GPU slot only

c) 2005 Toshiba M45-S551 laptop w/2 GB mem, 160GB HDD;Pent M 740 CPU @ 1.73 GHz
WinXP SP3-32 FAH v7.3.6 [Receiving Core A4 work units]
d) 2011 lappy-15.6"-1920x1080;i7-2860QM,2.5;IC Diamond Thermal Compound;GTX 560M 1,536MB u/c@700;16GB-1333MHz RAM;HDD:500GBHyb w/ 4GB SSD;Win7HomePrem64;320.18 drivers FAH 7.4.2ß
Location: Saratoga, California USA

Re: 171.64.65.64 overloaded

Post by GreyWhiskers »

I've extracted from the server log selected columns for 77 times between May 29 and today, June 22 when the CONNECT status was anything but Accepting.

As a "frequent flyer" for 171.64.65.64, I'm wondering if there is something systemic that makes this server down so much - is it the server code, the hardware, the WUs that are loaded, or what. Is that likely to be fixed anytime soon? Or should we adjust our expectations that it will be periodically off-line?

Since this is serving WUs to very fast Fermi GPUs who come back to the well every couple of hours, there seems to be a lot of dead time while clients are trying to unload completed WUs, and can't get new WUs from one of the other servers for Fermi WUs that are still up until X number of failed attempts to connect to 171.64.65.64.

I'm appreciative of members of PG, including Dr. Pande, in giving updates in this thread. Maybe it's time for another brief update.

Re: 171.64.65.64 overloaded
by yslin » Sat May 21, 2011 1:49 pm

Hi,

I've been working on this server but it might take more to fix. Sorry for the inconveniences!


yslin

yslin
Pande Group Member
Re: 171.64.65.64 overloaded
by VijayPande » Sat May 21, 2011 3:56 pm

It's still having problems, we so we're doing a hard reboot. The machine will likely fsck for a while. We'll give you an update when we know more.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University

VijayPande
Pande Group Member

Code: Select all

                             SERVER IP     WHO  STATUS     CONNECT    CPU LOAD   NET LOAD   DL WUs AVAIL   WUs to go  WUs WAIT
Sun May 29 17:05:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.64         44 31      135700     135700     135700
Sun May 29 20:00:10 PDT 2011 171.64.65.64  lin5 full       Reject             0.7         40 44      135596     135596     135596
Mon May 30 12:00:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.02         26 36      135674     135674     135674
Tue May 31 08:05:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.74         52 36      135886     135886     135886
Tue May 31 11:35:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.59         44 36      135978     135978     135978
Tue May 31 19:10:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.66         44 29      135973     135973     135973
Wed Jun 1 10:30:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.67         37 25      135771     135771     135771
Wed Jun 1 14:35:11 PDT 2011  171.64.65.64  lin5 full       Reject            0.76         37 31      135856     135856     135856
Thu Jun 2 01:45:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.62         45 33      135894     135894     135894
Thu Jun 2 07:10:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.83         54 25      136006     136006     136006
Thu Jun 2 12:25:10 PDT 2011  171.64.65.64  -    full       DOWN       -          -           34-           -          -
Thu Jun 2 20:40:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.42         43 33      135845     135845     135845
Fri Jun 3 04:30:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.59         69 27      135852     135852     135852
Fri Jun 3 16:10:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.67         80 30      135683     135683     135683
Sun Jun 5 08:55:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.61         52 22      135881     135881     135881
Sun Jun 5 14:45:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.97         33 26      135884     135884     135884
Sun Jun 5 17:05:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.87         19 30      135759     135759     135759
Mon Jun 6 05:35:10 PDT 2011  171.64.65.64  lin5 full       Reject            1.02          0 30      135671     135671     135671
Mon Jun 6 09:45:10 PDT 2011  171.64.65.64  lin5 full       Reject            1.08          0 23      135712     135712     135712
Mon Jun 6 15:35:10 PDT 2011  171.64.65.64  lin5 full       Reject             0.6         52 32      135820     135820     135820
Mon Jun 6 22:00:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.99         34 31           0          0          0
Tue Jun 7 19:15:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.86         45 32      135975     135975     135975
Tue Jun 7 21:35:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.62         25 25      135738     135738     135738
Wed Jun 8 13:40:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.68         43 40      135959     135959     135959
Wed Jun 8 17:45:10 PDT 2011  171.64.65.64  lin5 full       Reject            1.19         40 37      135953     135953     135953
Thu Jun 9 03:50:10 PDT 2011  171.64.65.64  lin5 full       Reject            1.04         41 30      135867     135867     135867
Thu Jun 9 12:00:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.97         39 25           0          0          0
Thu Jun 9 17:15:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.92         61 20      135880     135880     135880
Thu Jun 9 19:35:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.87         41 20      135973     135973     135973
Thu Jun 9 21:55:10 PDT 2011  171.64.65.64  lin5 full       Reject            1.03         46 25      135907     135907     135907
Thu Jun 9 23:40:11 PDT 2011  171.64.65.64  lin5 full       Reject               1         58 29      135722     135722     135722
Fri Jun 10 03:55:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.09        102 29      135833     135833     135833
Fri Jun 10 14:25:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.98         47 20      136055     136055     136055
Fri Jun 10 19:05:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.62         47 21      136026     136026     136026
Fri Jun 10 21:25:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.66         41 28      135969     135969     135969
Sat Jun 11 08:35:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.56         58 18      136001     136001     136001
Sat Jun 11 12:40:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.96         42 21      135974     135974     135974
Sat Jun 11 20:15:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.01         75 25           0          0          0
Sun Jun 12 04:35:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.85         53 18      135859     135859     135859
Sun Jun 12 12:10:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.84         50 23      136084     136084     136084
Sun Jun 12 13:55:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.36         50 23      135951     135951     135951
Sun Jun 12 23:15:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.09          0 28      135780     135780     135780
Mon Jun 13 01:00:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.73         63 17      136218     136218     136218
Mon Jun 13 05:40:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.03          0 39      135797     135797     135797
Mon Jun 13 08:35:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.92         45 39      135854     135854     135854
Mon Jun 13 15:00:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.65         61 26      136054     136054     136054
Tue Jun 14 02:55:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.75         56 15      136042     136042     136042
Wed Jun 15 03:05:10 PDT 2011 171.64.65.64  lin5 standby    Not Accept        0.61         41 25      135992     135992     135992
Wed Jun 15 05:25:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.89         21 20      135759     135759     135759
Wed Jun 15 08:20:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.74         89 23      135805     135805     135805
Wed Jun 15 10:40:10 PDT 2011 171.64.65.64  lin5 full       Reject             0.7         69 23      135897     135897     135897
Thu Jun 16 06:55:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.91         57 11      119531     119531     119531
Thu Jun 16 13:55:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.73         41 15      112457     112457     112457
Thu Jun 16 15:40:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.89         81 16           0          0          0
Fri Jun 17 07:40:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.83         28 28       99541      99541      99541
Fri Jun 17 11:10:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.55         66 28       98062      98062      98062
Sat Jun 18 10:45:11 PDT 2011 171.64.65.64  lin5 full       Reject            0.49         50 14       96491      96491      96491
Sat Jun 18 14:20:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.02        134 16           0          0          0
Sat Jun 18 16:05:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.44         38 17       96379      96379      96379
Sat Jun 18 21:55:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.01         31 19           0          0          0
Sun Jun 19 04:00:11 PDT 2011 171.64.65.64  lin5 full       Reject            1.06         36 14       96604      96604      96604
Sun Jun 19 11:00:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.49         53 13       96562      96562      96562
Mon Jun 20 02:55:10 PDT 2011 171.64.65.64  lin5 standby    Not Accept        1.49         58 19       96788      96788      96788
Mon Jun 20 04:45:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.08          0 12       96622      96622      96622
Mon Jun 20 05:20:11 PDT 2011 171.64.65.64  lin5 full       Reject            0.84         45 12       96904      96904      96904
Mon Jun 20 07:05:11 PDT 2011 171.64.65.64  lin5 full       Reject            0.97        305 12           0          0          0
Mon Jun 20 10:00:10 PDT 2011 171.64.65.64  lin5 full       Reject               1          0 17       96553      96553      96553
Mon Jun 20 10:35:11 PDT 2011 171.64.65.64  lin5 full       Reject            0.61         45 17       96740      96740      96740
Mon Jun 20 18:10:10 PDT 2011 171.64.65.64  lin5 full       Reject            2.61          0 33       96348      96348      96348
Mon Jun 20 18:45:11 PDT 2011 171.64.65.64  lin5 full       Reject             2.6          0 33       96348      96348      96348
Mon Jun 20 19:20:10 PDT 2011 171.64.65.64  lin5 full       Reject            2.65          0 33       96348      96348      96348
Mon Jun 20 19:55:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.67        971  1           0          0          0
Mon Jun 20 22:15:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.31         68  1       96951      96951      96951
Tue Jun 21 14:20:10 PDT 2011 171.64.65.64  lin5 full       Reject            2.02        479  9           0          0          0
Tue Jun 21 16:05:10 PDT 2011 171.64.65.64  lin5 full       Reject            2.14        212  7           0          0          0
Wed Jun 22 09:05:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.92          0  6       96429      96429      96429
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 171.64.65.64 overloaded

Post by bruce »

The Pande Group is aware of the problem. The ultimate fix is complex and will probably not happen quickly.:(

I don't think that taking the server off-line is what anybody wants to do.
noorman
Posts: 270
Joined: Sun Dec 02, 2007 2:26 pm
Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+
Location: Belgium, near the International Sea-Port of Antwerp

Re: 171.64.65.64 overloaded

Post by noorman »

Where is the redundancy?
- stopped Linux SMP w. HT on i7-860@3.5 GHz
....................................
Folded since 10-06-04 till 09-2010
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: 171.64.65.64 overloaded

Post by VijayPande »

We've continued to have trouble with the CS code. Joe is in the process of overhauling it. The current code works well under medium loads, but doesn't scale well. Joe's new scheme greatly simplifies how the CS works to help it scale better. Joe has been working on this the last few weeks, which has slowed down his v7 client work, but this is a very high priority in my mind for situations like this one.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: 171.64.65.64 overloaded

Post by VijayPande »

PS I've taken this server weight down to try to help balance it with the other GPU servers.

Also, I emailed Dr. Lin to have her push to get new projects going on a new server which has been assigned to her projects. That new server is much more powerful so it can take a much greater load.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 171.64.65.64 overloaded

Post by bruce »

noorman wrote:Where is the redundancy?
There are three different answers to your question, depending on which problem you're referring to. The title of the topic says "overloaded" but a few posts up, GreyWhiskers asked a question about the server status being other than Accepting and that's an entirely different question.

Redundancy for downloading new WUs comes from the Assignment Server sending requests for new work to another Work Server. This concept works fine when one work server is down or is out of work. In that regard, it's better for the server to take itself off-line rather than to be overloaded. You'll notice that there are other GPU servers, so that doesn't seem to be a problem. Dr. Pande's statement about adjusting the weights is important, too. This helps balance the load when there are several Work Servers with WUs available (helping to keep any one of them from being overloaded unless they're all overloaded). Adding more projects to a different Work Server helps, too, but can't be done as quickly.

When WUs are finished, they need to find their way back to the same server that assigned them. If the server is overloaded with uploads or is off-line, the upload redundancy comes from the Collection Servers. That's an entirely different question than redundancy for downloads but is clearly being worked on. Although many people may gripe about it, it's not really a problem from an individual's point of view since there is no QRB (yet) for GPUs and all clients manage un-uploaded WUs by retrying later. Yes, this does delay the science, but that's a problem for the Pande Group, not for you, Please recognize that they need to address it based on PG priorities, not based on donor opinions.
noorman
Posts: 270
Joined: Sun Dec 02, 2007 2:26 pm
Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+
Location: Belgium, near the International Sea-Port of Antwerp

Re: 171.64.65.64 overloaded

Post by noorman »

.
Sorry, it delays the science, but it also puts lots of systems in idle ...
That doesn't help anyone and is very inefficient (costs for donors that bring nothing for anyone)
.
- stopped Linux SMP w. HT on i7-860@3.5 GHz
....................................
Folded since 10-06-04 till 09-2010
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 171.64.65.64 overloaded

Post by bruce »

No, it does NOT put systems in idle as long as there are other servers that have WUs to assign -- and that appears to be true.

Which of the three problems I mentioned are you having?
noorman
Posts: 270
Joined: Sun Dec 02, 2007 2:26 pm
Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+
Location: Belgium, near the International Sea-Port of Antwerp

Re: 171.64.65.64 overloaded

Post by noorman »

.
I 'm not having trouble (I stopped F@H because of energy prices being too high over here and fin. situation), my friend is.
He 's not getting WU's and not able to send back results on an almost regular basis.
A consequence of the overload (and from being in REJECT)
.
- stopped Linux SMP w. HT on i7-860@3.5 GHz
....................................
Folded since 10-06-04 till 09-2010
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: 171.64.65.64 overloaded

Post by VijayPande »

Sorry to hear you're still having problems. The server is not in REJECT and has been working pretty well today. Right now, it only has 23 connections, which is pretty low. I wonder if there's something else going on other than the machine being loaded? Any chance it's an issue with your friends ISP? At times, we see weird things with ISP's.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
GreyWhiskers
Posts: 660
Joined: Mon Oct 25, 2010 5:57 am
Hardware configuration: a) Main unit
Sandybridge in HAF922 w/200 mm side fan
--i7 2600K@4.2 GHz
--ASUS P8P67 DeluxeB3
--4GB ADATA 1600 RAM
--750W Corsair PS
--2Seagate Hyb 750&500 GB--WD Caviar Black 1TB
--EVGA 660GTX-Ti FTW - Signature 2 GPU@ 1241 Boost
--MSI GTX560Ti @900MHz
--Win7Home64; FAH V7.3.2; 327.23 drivers

b) 2004 HP a475c desktop, 1 core Pent 4 HT@3.2 GHz; Mem 2GB;HDD 160 GB;Zotac GT430PCI@900 MHz
WinXP SP3-32 FAH v7.3.6 301.42 drivers - GPU slot only

c) 2005 Toshiba M45-S551 laptop w/2 GB mem, 160GB HDD;Pent M 740 CPU @ 1.73 GHz
WinXP SP3-32 FAH v7.3.6 [Receiving Core A4 work units]
d) 2011 lappy-15.6"-1920x1080;i7-2860QM,2.5;IC Diamond Thermal Compound;GTX 560M 1,536MB u/c@700;16GB-1333MHz RAM;HDD:500GBHyb w/ 4GB SSD;Win7HomePrem64;320.18 drivers FAH 7.4.2ß
Location: Saratoga, California USA

Re: 171.64.65.64 overloaded

Post by GreyWhiskers »

I wanted to toss a few numbers out to show that while the server in question has had a lot of downtime, the affect on my folding over the last month has been minimal. Hats off to PG and to the flexibility of the system.

I'm running one GTX 560Ti, still with v6, so I can, and do, track all of the wu-by-wu stats in the HFM WU history log. I typically run DatAdmin 3 to export the MySQL DB to a CSV file for massaging with Excel.

Bottom line. During the month of June, I have 9 instances out of 226 completed GPU WUs where the "turn around time" has been anomalous. That is the time from the completion of one WU until the start of the next.

158 of the 226 June WUs were P6801, which involve 171.64.65.64. In the last couple of weeks, more and more of the WUs are from other projects.

96% of my WUs in June turned around in 10-30 seconds. The 9 anomalous instances range from 1:02 (mm:ss) to 17:35, with one outlier at over 4 hours. This is the period where so many of the 171.64.65.64 REJECT periods occurred.

The anomalous turn arounds were mostly correlated with either 171.64.65.64 REJECT periods, or with heavy "WU Received" periods, as reported in the Server Stats. And, some of these WU RCV loads have been heavy - see charts at bottom of post

That outlier occurred during one of the server REJECT periods. I can't remember exactly, but I may have stopped the GPU folding for a while to play with my SMP settings.

My 96% quick turn around could have been a little better in v7, since v6 won't attempt to get a new WU until after several failed attempts to upload the just-completed WU. v7 separates the upload/download, so if the assignment server recognizes not to assign me to the downed work server, then I could pick up a new WU possibly sooner.

Bottom line, for me at least. Good job to PG. The SYSTEM seems to be working well. I was surprised, once I looked at the actual data, how good the overall turn-around for my series of Core 16 Nvidia GPU projects was.

Code: Select all

"anomalous" finish-to-start times
hh:mm:ss
00:05:09
00:04:53
00:13:04
00:12:34
00:01:02
04:13:30
00:01:59
00:02:36
00:17:25

While I was looking at the server stats, I ran a couple of excel spreadsheets and charts. These two charts show how busy this server has been. No deep message or analysis here, just some interesting collateral information. The timescale is each individual half-hour update to the stats page when I pulled it a couple of days ago http://fah-web.stanford.edu/logs/171.64.65.64.log.html

NETLOAD tells how busy the server is by netstat (i.e. how many current connections the server is handling). Too many connections means that the server is heavily loaded. How many are "too many" depends on the server, but most of our servers can now handle a couple hundred connections without a problem.

NOTE LOG SCALE
Image

WUS RCVD shows how many WUs have been received since the last time the servers WUs were updated into the stats. This shows the relative number of WUs being received on the different servers (if all is fine) or which servers are not being inputted into the stats db if there is some problem.

NOTE LINEAR SCALE
Image
noorman
Posts: 270
Joined: Sun Dec 02, 2007 2:26 pm
Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+
Location: Belgium, near the International Sea-Port of Antwerp

Re: 171.64.65.64 overloaded

Post by noorman »

VijayPande wrote:Sorry to hear you're still having problems. The server is not in REJECT and has been working pretty well today. Right now, it only has 23 connections, which is pretty low. I wonder if there's something else going on other than the machine being loaded? Any chance it's an issue with your friends ISP? At times, we see weird things with ISP's.
.

I 'm very sure my friend is very capable to distinguish local network (or ISP) problems from server problems.
He has many systems running, all with multiple GPU card setups.
He 'd just like his systems to be running F@H without almost regular interruption because a finished WU (results) cannot be returned or because there is no new Work available or because the server is overloaded.
Specifically this server, with Fermi Work, has to be a heavy duty system because of the very fast turnaround of the Work that is coming from it.
If, in future, Fermi is used better, the return of Work might increase still, which would load that server even more than it is already.

Another point: I 'm sure it 's not possible that a network or ISP problem at my friend's could explain a 'low' connection count to named server ...
.
- stopped Linux SMP w. HT on i7-860@3.5 GHz
....................................
Folded since 10-06-04 till 09-2010
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 171.64.65.64 overloaded

Post by bruce »

noorman wrote:I 'm sure it 's not possible that a network or ISP problem at my friend's could explain a 'low' connection count to named server ...
.
Minor technicality: 171.64.65.64 is NOT a named server as far as FAH is concerned. Yes, the server has a name, but FAH does not reference DNS, it uses the IP address.
GreyWhiskers wrote:WUS RCVD shows how many WUs have been received since the last time the servers WUs were updated into the stats. This shows the relative number of WUs being received on the different servers (if all is fine) or which servers are not being inputted into the stats db if there is some problem.
Minor question: how do you account for the WUs Received count being reset when the stats are uploaded to the stats server?
Post Reply