Page 1 of 2
RESOLVED: 171.64.65.54 overloaded / NOT accepting
Posted: Mon Jun 14, 2010 3:26 pm
by noorman
.
Code: Select all
8 cores detected
If you see this twice, MPI is working
If you see this twice, MPI is working
--- Opening Log file [June 14 15:20:45 UTC]
# Windows SMP Console Edition #################################################
###############################################################################
Folding@Home Client Version 6.29
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: C:\Folding-at-Home
Executable: C:\Folding-at-Home\FaH6.exe
Arguments: -verbosity 9 -smp -send all
[15:20:45] - Ask before connecting: No
[15:20:45] - User name: noorman (Team 734)
[15:20:45] - User ID: 7D1BA32F532694B4
[15:20:45] - Machine ID: 1
[15:20:45]
[15:20:45] Loaded queue successfully.
[15:20:45] Attempting to return result(s) to server...
[15:20:45] Trying to send all finished work units
[15:20:45] Project: 6041 (Run 0, Clone 51, Gen 30)
[15:20:45] + Attempting to send results [June 14 15:20:45 UTC]
[15:20:45] - Reading file work/wuresults_05.dat from core
[15:20:45] (Read 63945075 bytes from disk)
[15:20:45] Connecting to http://171.64.65.54:8080/
It just hangs there; wasn't able to send my recently finished A3 core Results when the Client was running 'normally' either !
Just tried it with -send all; just the same story (of course)
( but this way I can restart it faster to try and get it sent off)
.
Re: 171.64.65.54 NOT Accepting
Posted: Mon Jun 14, 2010 4:26 pm
by kasson
It's assigning and accepting right now. The server is under fairly heavy load--it's possible that all the work threads were busy when you tried to connect.
Re: 171.64.65.54 NOT Accepting
Posted: Mon Jun 14, 2010 4:40 pm
by noorman
.
Code: Select all
8 cores detected
If you see this twice, MPI is working
If you see this twice, MPI is working
--- Opening Log file [June 14 16:37:54 UTC]
# Windows SMP Console Edition #################################################
###############################################################################
Folding@Home Client Version 6.29
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: C:\Folding-at-Home
Executable: C:\Folding-at-Home\FaH6.exe
Arguments: -verbosity 9 -smp -send all
[16:37:54] - Ask before connecting: No
[16:37:54] - User name: noorman (Team 734)
[16:37:54] - User ID: 7D1BA32F532694B4
[16:37:54] - Machine ID: 1
[16:37:54]
[16:37:54] Loaded queue successfully.
[16:37:54] Attempting to return result(s) to server...
[16:37:54] Trying to send all finished work units
[16:37:54] Project: 6041 (Run 0, Clone 51, Gen 30)
[16:37:54] + Attempting to send results [June 14 16:37:54 UTC]
[16:37:54] - Reading file work/wuresults_05.dat from core
[16:37:57] (Read 63945075 bytes from disk)
[16:37:57] Connecting to http://171.64.65.54:8080/
[16:38:03] Posted data.
[16:38:03] Initial: 683C; + Could not connect to Work Server (results)
[16:38:03] (171.64.65.54:8080)
[16:38:03] + Retrying using alternative port
[16:38:03] Connecting to http://171.64.65.54:80/
[16:38:08] Posted data.
[16:38:19] Initial: 683C; + Could not connect to Work Server (results)
[16:38:19] (171.64.65.54:80)
[16:38:19] - Error: Could not transmit unit 05 (completed June 14) to work serve
r.
[16:38:19] - 2 failed uploads of this unit.
[16:38:19] + Attempting to send results [June 14 16:38:19 UTC]
[16:38:19] - Reading file work/wuresults_05.dat from core
[16:38:19] (Read 63945075 bytes from disk)
[16:38:19] Connecting to http://171.67.108.25:8080/
.
Still not doing it for me ...
.
Re: 171.64.65.54 NOT Accepting
Posted: Mon Jun 14, 2010 8:20 pm
by noorman
.
After about 7 hours, my SMP Results finally got uploaded
.
Re: 171.64.65.54 overloaded / NOT accepting
Posted: Mon Jun 14, 2010 10:11 pm
by Davabled
Still not accepting results for me as of 3 p.m. Pacific, been trying for a couple hours. Same for server 171.67.108.25
snippet from log file:
Code: Select all
[22:06:37] + Attempting to send results [June 14 22:06:37 UTC]
[22:06:58] - Couldn't send HTTP request to server
[22:06:58] + Could not connect to Work Server (results)
[22:06:58] (171.64.65.54:8080)
[22:06:58] + Retrying using alternative port
[22:07:19] - Couldn't send HTTP request to server
[22:07:19] + Could not connect to Work Server (results)
[22:07:19] (171.64.65.54:80)
[22:07:19] - Error: Could not transmit unit 01 (completed June 14) to work server.
[22:10:02] + Attempting to send results [June 14 22:10:02 UTC]
[22:10:06] + Could not connect to Work Server (results)
[22:10:06] (171.67.108.25:8080)
[22:10:06] + Retrying using alternative port
[22:10:15] + Could not connect to Work Server (results)
[22:10:15] (171.67.108.25:80)
[22:10:15] Could not transmit unit 01 to Collection server; keeping in queue.
[22:10:45] Project: 6041 (Run 0, Clone 68, Gen 23)
Re: 171.64.65.54 overloaded / NOT accepting
Posted: Tue Jun 15, 2010 1:00 am
by kasson
server .54 is currently down for maintenance on a RAID. Our sysadmins are aware this is a time-critical issue, and we'll get this up as soon as we can. No ETA, though.
@noorman, glad it worked. The system logs were a bit weird--in the middle of a bunch of accepts and assigns, there were connection attempts from your IP with nothing to follow. I'm not sure what to make of that.
Re: 171.64.65.54 overloaded / NOT accepting
Posted: Tue Jun 15, 2010 5:00 am
by Datsun 1600
No joy here either returning WUs, at least with only one boxen on ATM, I am not overloading the system. Will see how your points allocation on the P6701 is and decide if I will continue racking up a large power bill.
Re: 171.64.65.54 overloaded / NOT accepting
Posted: Tue Jun 15, 2010 6:05 am
by noorman
.
kasson wrote:server .54 is currently down for maintenance on a RAID. Our sysadmins are aware this is a time-critical issue, and we'll get this up as soon as we can. No ETA, though.
@noorman, glad it worked. The system logs were a bit weird--in the middle of a bunch of accepts and assigns, there were connection attempts from your IP with nothing to follow. I'm not sure what to make of that.
.
Here 's where my log indicated that it was uploaded:
Code: Select all
[17:19:08] + Attempting to send results [June 14 17:19:08 UTC]
[17:19:08] Core found.
[17:19:08] - Reading file work/wuresults_05.dat from core
[17:19:08] Working on queue slot 06 [June 14 17:19:08 UTC]
[17:19:10] + Working ...
[17:19:10] (Read 63945075 bytes from disk)
[17:19:10] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 06 -np 8 -nocpulock -checkpoint 3 -verbose -lifeline 596 -version 629'
[17:19:10] Connecting to http://171.64.65.54:8080/
[17:19:12]
[17:19:12] *------------------------------*
[17:19:12] Folding@Home Gromacs SMP Core
[17:19:12] Version 2.19 (Mar 12, 2010)
[17:19:12]
[17:19:12] Preparing to commence simulation
[17:19:12] - Ensuring status. Please wait.
[17:19:21] - Looking at optimizations...
[17:19:21] - Working with standard loops on this execution.
[17:19:21] - Previous termination of core was improper.
[17:19:21] - Going to use standard loops.
[17:19:21] - Files status OK
[17:19:22] - Expanded 1795892 -> 2078149 (decompressed 115.7 percent)
[17:19:22] Called DecompressByteArray: compressed_data_size=1795892 data_size=2078149, decompressed_data_size=2078149 diff=0
[17:19:22] - Digital signature verified
[17:19:22]
[17:19:22] Project: 6012 (Run 2, Clone 319, Gen 125)
[17:19:22]
[17:19:22] Entering M.D.
[17:19:28] Using Gromacs checkpoints
[17:19:30] Resuming from checkpoint
[17:19:30] Verified work/wudata_06.log
[17:19:30] Verified work/wudata_06.trr
[17:19:30] Verified work/wudata_06.edr
[17:19:31] Completed 44426 out of 500000 steps (8%)
[17:21:55] Completed 45000 out of 500000 steps (9%)
[17:43:53] Completed 50000 out of 500000 steps (10%)
[17:55:19] Completed 55000 out of 500000 steps (11%)
[17:59:29] Completed 60000 out of 500000 steps (12%)
[18:03:21] Completed 65000 out of 500000 steps (13%)
[18:07:11] Completed 70000 out of 500000 steps (14%)
[18:11:02] Completed 75000 out of 500000 steps (15%)
[18:14:57] Completed 80000 out of 500000 steps (16%)
[18:18:52] Completed 85000 out of 500000 steps (17%)
[18:22:44] Completed 90000 out of 500000 steps (18%)
[18:26:36] Completed 95000 out of 500000 steps (19%)
[18:30:23] Completed 100000 out of 500000 steps (20%)
[18:30:45] Posted data.
[18:30:46] Initial: 0000; + Could not connect to Work Server (results)
[18:30:46] (171.64.65.54:8080)
[18:30:46] + Retrying using alternative port
[18:30:46] Connecting to http://171.64.65.54:80/
[18:34:16] Completed 105000 out of 500000 steps (21%)
[18:38:06] Posted data.
[18:38:06] Initial: 0000; Completed 110000 out of 500000 steps (22%)
[18:38:07] + Results successfully sent
[18:38:07] Thank you for your contribution to Folding@Home.
[18:38:07] + Number of Units Completed: 15
[18:38:09] + Sent 1 of 1 completed units to the server
[18:38:09] - Autosend completed
[18:41:52] Completed 115000 out of 500000 steps (23%)
[18:45:39] Completed 120000 out of 500000 steps (24%)
.
.
Re: 171.64.65.54 overloaded / NOT accepting
Posted: Tue Jun 15, 2010 6:49 am
by noorman
kasson wrote:server .54 is currently down for maintenance on a RAID. Our sysadmins are aware this is a time-critical issue, and we'll get this up as soon as we can. No ETA, though.
@noorman, glad it worked. The system logs were a bit weird--in the middle of a bunch of accepts and assigns, there were connection attempts from your IP with nothing to follow. I'm not sure what to make of that.
.
I had been trying to send those Results by using a shortcut with the -send all switch
Since that didn't do it either, I gave up and re-launched F@H to try and Fold some more,
BUT as someone else also reported, Folding seemed to get
stuck when during Folding, an automatic send sequence was initiated !
Why would this happen and is this a BUG ?
I have never known a Folding run being 'sort of' paused for a try to upload previous Results from the queue ...
In those cases too, I then stopped the Client and restarted it shortly afterwards !
.
Re: 171.64.65.54 overloaded / NOT accepting
Posted: Tue Jun 15, 2010 8:40 am
by AlanH
My system returned a unit to this server yesterday but it is not shown as credited.
Code: Select all
[19:05:46] Folding@home Core Shutdown: FINISHED_UNIT
[19:05:47] CoreStatus = 64 (100)
[19:05:47] Unit 6 finished with 90 percent of time to deadline remaining.
[19:05:47] Updated performance fraction: 0.877645
[19:05:47] Sending work to server
[19:05:47] Project: 6060 (Run 0, Clone 4, Gen 80)
[19:05:47] + Attempting to send results [June 14 19:05:47 UTC]
[19:05:47] - Reading file work/wuresults_06.dat from core
[19:05:47] (Read 3801645 bytes from disk)
[19:05:47] Connecting to http://171.64.65.54:8080/
[19:07:08] Posted data.
[19:07:09] Initial: 0000; - Uploaded at ~45 kB/s
[19:07:09] - Averaged speed for that direction ~22 kB/s
[19:07:09] + Results successfully sent
[19:07:09] Thank you for your contribution to Folding@Home.
[19:07:09] + Number of Units Completed: 87
Re: 171.64.65.54 overloaded / NOT accepting
Posted: Tue Jun 15, 2010 1:35 pm
by Mactin
I've been trying to send work sinse last evening (10h00 Eastern, 02h00 GMT)
I came into work and saw :
Code: Select all
[02:12:19] Folding@home Core Shutdown: FINISHED_UNIT
[02:12:22] CoreStatus = 64 (100)
[02:12:22] Unit 7 finished with 86 percent of time to deadline remaining.
[02:12:22] Updated performance fraction: 0.865615
[02:12:22] Sending work to server
[02:12:22] Project: 6053 (Run 0, Clone 59, Gen 51)
[02:12:22] + Attempting to send results [June 15 02:12:22 UTC]
[02:12:22] - Reading file work/wuresults_07.dat from core
[02:12:22] (Read 3799048 bytes from disk)
[02:12:22] Connecting to http://171.64.65.54:8080/
[02:12:43] - Couldn't send HTTP request to server
[02:12:43] + Could not connect to Work Server (results)
[02:12:43] (171.64.65.54:8080)
[02:12:43] + Retrying using alternative port
[02:12:43] Connecting to http://171.64.65.54:80/
[02:13:05] - Couldn't send HTTP request to server
[02:13:05] + Could not connect to Work Server (results)
[02:13:05] (171.64.65.54:80)
[02:13:05] - Error: Could not transmit unit 07 (completed June 15) to work server.
[02:13:05] - 1 failed uploads of this unit.
[02:13:05] Keeping unit 07 in queue.
...
[13:10:34] Completed 480000 out of 2000000 steps (24%)
[13:12:29] - Autosending finished units... [June 15 13:12:29 UTC]
[13:12:29] Trying to send all finished work units
[13:12:29] Project: 6053 (Run 0, Clone 59, Gen 51)
[13:12:29] + Attempting to send results [June 15 13:12:29 UTC]
[13:12:29] - Reading file work/wuresults_07.dat from core
[13:12:29] (Read 3799048 bytes from disk)
[13:12:29] Connecting to http://171.64.65.54:8080/
[13:12:51] - Couldn't send HTTP request to server
[13:12:51] + Could not connect to Work Server (results)
[13:12:51] (171.64.65.54:8080)
[13:12:51] + Retrying using alternative port
[13:12:51] Connecting to http://171.64.65.54:80/
[13:13:12] - Couldn't send HTTP request to server
[13:13:12] + Could not connect to Work Server (results)
[13:13:12] (171.64.65.54:80)
[13:13:12] - Error: Could not transmit unit 07 (completed June 15) to work server.
[13:13:12] - 5 failed uploads of this unit.
[13:13:12] + Attempting to send results [June 15 13:13:12 UTC]
[13:13:12] - Reading file work/wuresults_07.dat from core
[13:13:12] (Read 3799048 bytes from disk)
[13:13:12] Connecting to http://171.67.108.25:8080/
[13:13:15] Posted data.
[13:13:15] Initial: 0000; + Could not connect to Work Server (results)
[13:13:15] (171.67.108.25:8080)
[13:13:15] + Retrying using alternative port
[13:13:15] Connecting to http://171.67.108.25:80/
[13:13:19] Posted data.
[13:13:19] Initial: 0000; + Could not connect to Work Server (results)
[13:13:19] (171.67.108.25:80)
[13:13:19] Could not transmit unit 07 to Collection server; keeping in queue.
[13:13:19] + Sent 0 of 1 completed units to the server
[13:13:19] - Autosend completed
Like I said befor my head does not care but my heart desparately care for the points that I'm loosing!
This is the reason that I HATE the bonus scheme. In the past, I could not care less about this, now I care a lot, because for every second that a PG server is down, PG takes points away. All the other positives go out the door.
Keep on folding
Re: 171.64.65.54 overloaded / NOT accepting
Posted: Tue Jun 15, 2010 3:13 pm
by Grandpa_01
Just remember it is not just you loosing we all are. The server does not care who you are or who I am so we all get to loose equally.
Re: 171.64.65.54 overloaded / NOT accepting
Posted: Tue Jun 15, 2010 3:15 pm
by lanbrown
Mactin,
I agree. Points are being lost and there is no recourse. For the bonus scheme to actually work, they needed to change the collection method. Of the two ways I can think of off the top of my head.
1) Redundant servers. That could be harder than it sounds though and could require major code changes. A load balancer could be an option, but both servers would need to be in constant communication with each other so that each knows of every WU that has been assigned.
2) If the collection server is offline, then the client communicates through an encrypted session to another server or could even be the assignment server. In this secured session, a timestamp or hashkey is provided. This is added to the WU, when the server is back on-line, it gets sent and the server take the timestamp/hashkey into account of when the WU as actually completed. This prevents someone from changing the time on the machine to get higher bonus points.
There are times where servers are taken offline for maintenance and causes bonus point issues. Which the time lines set short for SMP units and are the only ones currently eligible for the bonus points, if maintenance is planned and the final deadline is six-days, then six-days before the maintenance is planned the assignment server should no longer be sending clients to that server. This gives every client the full amount of time to complete the WU and send it back before it times out. This also gives the admins as much time as they require to finish the work. The same should go to problems servers as well.
Re: 171.64.65.54 overloaded / NOT accepting
Posted: Tue Jun 15, 2010 4:02 pm
by lanbrown
Grandpa_01 wrote:Just remember it is not just you loosing we all are. The server does not care who you are or who I am so we all get to loose equally.
Not true at all. Let's say it takes 26 hours to complete a WU and the server is down for 24-hours.
Contributor A get a WU 25 hours before the server goes down. So that means an hour before the WU is completed, the server goes down and will be down for a day.
Contributor B has a system with equivalent performance and gets a new WU an hour before the server goes down. The server will be back on-line before the WU is completed.
Contributor A loses bonus points, contributor B does not.
Re: 171.64.65.54 overloaded / NOT accepting
Posted: Tue Jun 15, 2010 4:14 pm
by noorman
lanbrown wrote:Grandpa_01 wrote:Just remember it is not just you loosing we all are. The server does not care who you are or who I am so we all get to loose equally.
Not true at all. Let's say it takes 26 hours to complete a WU and the server is down for 24-hours.
Contributor A get a WU 25 hours before the server goes down. So that means an hour before the WU is completed, the server goes down and will be down for a day.
Contributor B has a system with equivalent performance and gets a new WU an hour before the server goes down. The server will be back on-line before the WU is completed.
Contributor A loses bonus points, contributor B does not.
.
It 's indeed all about timing !
The same goes for Stats; Stanford now has a 2 hour refresh rate (I believe) and f.e. EOC Stats stiil has the 3 hour cycle, which also skews data and points !
.