Page 1 of 1
171.67.108.12 No Record Of Unit [resolved]
Posted: Sun Feb 01, 2009 8:13 pm
by Desertfox80
And so it continues with upload failures. Besides the five failed uploads mentioned in my previous post for a different IP, now I'm getting "Server does not have record of this unit. Will try again later." from this IP when attempting to upload completed WU's.
[19:51:49] Project: 5113 (Run 94, Clone 29, Gen 17)
[19:51:49] - Read packet limit of 540015616... Set to 524286976.
[19:51:49] + Attempting to send results [February 1 19:51:49 UTC]
[19:52:03] - Couldn't send HTTP request to server
[19:52:03] + Could not connect to Work Server (results)
[19:52:03] (171.67.108.12:8080)
[19:52:03] + Retrying using alternative port
[19:52:11] - Couldn't send HTTP request to server
[19:52:11] + Could not connect to Work Server (results)
[19:52:11] (171.67.108.12:80)
[19:52:11] - Error: Could not transmit unit 03 (completed February 1) to work server.
[19:52:11] - Read packet limit of 540015616... Set to 524286976.
[19:52:11] + Attempting to send results [February 1 19:52:11 UTC]
[19:53:11] - Server does not have record of this unit. Will try again later.
I put together a small farm of 10 computers to help with this project, but if I'm going to have to constantly baby the FAH software and chase after failures, this isn't worth the time.
Re: 171.67.108.12 No Record Of Unit
Posted: Sun Feb 01, 2009 9:45 pm
by toTOW
Server 171.67.108.12 looks fine according to server status page ...
"Server does not have record of this unit. Will try again later." is a message issued by a Collection Server :
http://fahwiki.net/index.php/Common_Err ... _this_unit
Re: 171.67.108.12 No Record Of Unit
Posted: Sun Feb 01, 2009 9:46 pm
by 7im
The fah client is designed with several redundancies. If the servers are busy or offline when you complete a work unit, the client will put the completed work unit in a queue, and download a new work unit to stay busy. The client will keep trying to upload the completed work unit every 6 hours automatically, until the servers are ready again.
With a typical CPU client, I don't even begin to worry for at least 24 hours, which gives Stanford a chance to fix any problems with the servers. And the work unit doesn't expire for many weeks yet.
A few failed upload attempts is not out of the ordinary from time to time. The problem usually corrects itself, and without any need for intervention, and without loosing any points (they are simply delayed a bit).
You should also note, there is little we can do to help you troubleshoot the issue without more details about your system, which client you are running, how you connect to the internet, etc. And posting the first 35 lines of your fahlog.txt file would be a very good start. (it contains the answers to about 10 of the first 20 questions we're likely to ask you, and saves a lot of time and back and forth posting)
Re: 171.67.108.12 No Record Of Unit
Posted: Sun Feb 01, 2009 10:41 pm
by Desertfox80
Well I have WU's that date back two weeks now that have not uploaded. I've posted the specs before but...
FAH CPU Client Version 6.23 Built November 26, 2008
There are three different computers with failed WU's, however the system specs for this particular one are:
Windows Vista Home Premium 32-bit SP1
AMD Athlon 64 X2 5000+ (2X 2.6GHz)
3GB RAM
Nvidia GeForce 9500GT
The FAH farm computers are all connected to a dedicated 3MBps DSL connection. I guess the info you are looking for in the start of the log is;
Core required: FahCore_a0.exe
Core found.
Folding@Home Gromacs 3.3 Core
Version 1.93 (July 23, 2008)
Re: 171.67.108.12 No Record Of Unit
Posted: Mon Feb 02, 2009 3:57 am
by 7im
Sorry, hadn't seen your other thread.
2 weeks IS a problem. v6.23 client version answers one big question and eliminates one big problem, and one or two other questions. (the fahlog.txt would have been better though)
toTOW was partly correct, the server status page does show that work server (171.67.108.12) as being up and functional. However, the small portion of the log posted above shows the work server posting that WU not found message, which is a slightly bigger problem than the Collection Server (171.67.108.25) not knowing about the WU.
Which raises a question... if you check through the fahlog.txt file, does it show the client trying to connect to 171.67.108.25 as well as 171.67.108.12? (another reason fahlog.txt content is helpful) Even if the work server gets busy, the client should try sending the work unit back to a collection server, as another level of redundancy.
toTOW or another forum staffer might help you by checking to see if that work unit has been returned by anyone. And maybe one of them could bring this problem to the attention of a project staffer after the weekend.
And another question on connectivity... have you customized the settings in that DSL much, or is it pretty much standard out of the box? If customized, have you changedthings like Stateful Packet Inspection, or any of the other firewall settings? And how much spare bandwidth do you have with that many systems in your fah farm?
Re: 171.67.108.12 No Record Of Unit
Posted: Mon Feb 02, 2009 4:32 am
by Desertfox80
I've copied the log for last upload attempt by machine 1. This machine has two units that have failed to upload. One from Jan 27 and the one from Feb 1. The latest attempts seem to show "error 503" rather than no record of the unit. I did a tracert on this particular IP and put the results at the end of this entry. Much like the problems I'm having with the other machines and the other failed upload on this one, it seems to timeout at step 10 every time.
The bandwidth should be just fine, the DSL connection is dedicated for these machines only, and no other traffic other than FAH goes over that connection. I have a separate cable internet connection for my home use so I don't clog up the tubes.
[22:38:42] - Machine ID: 1
[22:38:42]
[22:38:42] Loaded queue successfully.
[22:38:42] Initialization complete
[22:38:42]
[22:38:42] + Processing work unit
[22:38:42] Core required: FahCore_a0.exe
[22:38:42] Core found.
[22:38:42] Project: 4102 (Run 83, Clone 1, Gen 22)
[22:38:42] - Read packet limit of 540015616... Set to 524286976.
[22:38:42] + Attempting to send results [February 1 22:38:42 UTC]
[22:38:42] Working on queue slot 04 [February 1 22:38:42 UTC]
[22:38:42] + Working ...
[22:38:42]
[22:38:42] *------------------------------*
[22:38:42] Folding@Home Gromacs 3.3 Core
[22:38:42] Version 1.93 (July 23, 2008)
[22:38:42]
[22:38:42] Preparing to commence simulation
[22:38:42] - Looking at optimizations...
[22:38:42] - Files status OK
[22:38:43] - Expanded 1159730 -> 6173133 (decompressed 532.2 percent)
[22:38:43]
[22:38:43] Project: 5113 (Run 81, Clone 56, Gen 21)
[22:38:43]
[22:38:43] Assembly optimizations on if available.
[22:38:43] Entering M.D.
[22:38:49] FAH Init
[22:38:49] Checkpoint file:
[22:38:49] - Couldn't send HTTP request to server
[22:38:49] + Could not connect to Work Server (results)
[22:38:49] (171.64.65.111:8080)
[22:38:49] + Retrying using alternative port
[22:38:55] (Starting from checkpoint)
[22:38:55] Read checkpoint
[22:38:55] Protein: Calmodulin in water
[22:38:55] Writing local files
[22:38:56] Completed 22187 out of 500000 steps (4 percent)
[22:38:56] Extra SSE boost OK.
[22:38:57] - Couldn't send HTTP request to server
[22:38:57] + Could not connect to Work Server (results)
[22:38:57] (171.64.65.111:80)
[22:38:57] - Error: Could not transmit unit 00 (completed January 27) to work server.
[22:38:57] - Read packet limit of 540015616... Set to 524286976.
[22:38:57] + Attempting to send results [February 1 22:38:57 UTC]
[22:38:57] - Couldn't send HTTP request to server
[22:38:57] (Got status 503)
[22:38:57] + Could not connect to Work Server (results)
[22:38:57] (171.67.108.17:8080)
[22:38:57] + Retrying using alternative port
[22:38:57] - Couldn't send HTTP request to server
[22:38:57] (Got status 503)
[22:38:57] + Could not connect to Work Server (results)
[22:38:57] (171.67.108.17:80)
[22:38:57] Could not transmit unit 00 to Collection server; keeping in queue.
[22:38:57] Project: 5113 (Run 94, Clone 29, Gen 17)
[22:38:57] - Read packet limit of 540015616... Set to 524286976.
[22:38:57] + Attempting to send results [February 1 22:38:57 UTC]
[22:39:06] - Couldn't send HTTP request to server
[22:39:06] + Could not connect to Work Server (results)
[22:39:06] (171.67.108.12:8080)
[22:39:06] + Retrying using alternative port
[22:39:17] - Couldn't send HTTP request to server
[22:39:17] + Could not connect to Work Server (results)
[22:39:17] (171.67.108.12:80)
[22:39:17] - Error: Could not transmit unit 03 (completed February 1) to work server.
[22:39:17] - Read packet limit of 540015616... Set to 524286976.
[22:39:17] + Attempting to send results [February 1 22:39:17 UTC]
[22:39:17] - Couldn't send HTTP request to server
[22:39:17] (Got status 503)
[22:39:17] + Could not connect to Work Server (results)
[22:39:17] (171.67.108.25:8080)
[22:39:17] + Retrying using alternative port
[22:39:17] - Couldn't send HTTP request to server
[22:39:17] (Got status 503)
[22:39:17] + Could not connect to Work Server (results)
[22:39:17] (171.67.108.25:80)
[22:39:17] Could not transmit unit 03 to Collection server; keeping in queue
TRACE RESULTS
C:\>tracert 171.67.108.12
Tracing route to vsp22v.Stanford.EDU [171.67.108.12] over a maximum of 30 hops:
1 <1 ms 1 ms <1 ms home [192.168.1.254]
2 12 ms 9 ms 9 ms adsl-76-246-63-254.dsl.scrm01.sbcglobal.net [76.246.63.254]
3 12 ms 9 ms 11 ms dist2-vlan50.scrm01.pbi.net [64.171.152.67]
4 12 ms 11 ms 11 ms 151.164.93.214
5 16 ms 15 ms 15 ms 69.220.8.31
6 15 ms 15 ms 13 ms po5-2.core01.sjc04.atlas.cogentco.com [154.54.13.93]
7 13 ms 15 ms 15 ms te3-2.mpd01.sjc04.atlas.cogentco.com [66.28.4.49]
8 15 ms 15 ms 15 ms Stanford_University2.demarc.cogentco.com [66.250.7.138]
9 15 ms 15 ms 15 ms bbra-rtr.Stanford.EDU [171.64.1.151]
10 * * * Request timed out.
11 16 ms 15 ms 15 ms vsp22v.Stanford.EDU [171.67.108.12]
Trace complete.
Re: 171.67.108.12 No Record Of Unit
Posted: Tue Feb 03, 2009 5:26 am
by Desertfox80
The last two upload attempts to this server are back to "Server does not have record of this unit."
Re: 171.67.108.12 No Record Of Unit
Posted: Thu Feb 05, 2009 6:08 am
by Desertfox80
It resolved itself this afternoon. The first attempt got a message that the server does not have a record of this unit, thirty seconds later when it tried its second attempt, it uploaded just fine.