Page 1 of 3
128.143.231.201 (BA) acting up?
Posted: Wed Mar 26, 2014 7:03 am
by Zagen30
I noticed it was too quiet in my living room, where my BA server is located. When I remotely checked the logs, I saw that it had finished a BA WU but appeared to have gotten stuck at around 97% of the upload. It also hadn't gotten any new work for about 10 minutes. Pausing and unpausing the client didn't change anything. I hard rebooted the machine, and after it came back online it picked up the WU it had been working on at around 96%. When it got to 100% the second time, it again hung near the end of the upload and was unresponsive, even when I entered 'service FAHClient restart' directly into the machine.
The server status page says 128.143.231.201 is Full/Accepting, but the psummary pages are missing all five BA projects. It seems like normally in this case the servers should have been handing out regular SMP work, but that something about that hung upload process prevented that from happening. I had to remove client-type:bigadv and reboot the machine for the client to get some regular SMP work to do in the meantime.
There were a couple of other reports of lack of BA work over on the EVGA forum, so I don't think I'm alone in this. It should be noted that this machine is running 7.3.6.
Re: 128.143.231.201 (BA) acting up?
Posted: Wed Mar 26, 2014 7:35 am
by -alias-
You are not alone! The same happen with 4 of my BA-servers, se the log for the first server that got problems!
Code: Select all
[00:19:23] Completed 250000 out of 250000 steps (100%)
[00:19:35] DynamicWrapper: Finished Work Unit: sleep=10000
[00:19:45]
[00:19:45] Finished Work Unit:
[00:19:45] - Reading up to 64407792 from "work/wudata_07.trr": Read 64407792
[00:19:45] trr file hash check passed.
[00:19:45] - Reading up to 31622092 from "work/wudata_07.xtc": Read 31622092
[00:19:45] xtc file hash check passed.
[00:19:45] edr file hash check passed.
[00:19:45] logfile size: 196066
[00:19:45] Leaving Run
[00:19:49] - Writing 96386826 bytes of core data to disk...
[00:20:04] Done: 96386314 -> 91635114 (compressed to 5.9 percent)
[00:20:04] ... Done.
[00:20:04] - Shutting down core
[00:20:04]
[00:20:04] Folding@home Core Shutdown: FINISHED_UNIT
[00:20:04] CoreStatus = 64 (100)
[00:20:04] Unit 7 finished with 88 percent of time to deadline remaining.
[00:20:04] Updated performance fraction: 0.878883
[00:20:04] Sending work to server
[00:20:04] Project: 8102 (Run 0, Clone 0, Gen 566)
[00:20:04] + Attempting to send results [March 26 00:20:04 UTC]
[00:20:04] - Reading file work/wuresults_07.dat from core
[00:20:04] (Read 91635626 bytes from disk)
[00:20:04] Connecting to http://128.143.231.201:8080/
[02:34:04] - Autosending finished units... [March 26 02:34:04 UTC]
[02:34:04] Trying to send all finished work units
[02:34:04] - Already sending work
[02:34:04] + Sent 0 of 1 completed units to the server
[02:34:04] - Autosend completed
I restarted the server when I woke up, but there is no change, so the problem have to be on the Stanford side!
Code: Select all
[07:02:07] + Attempting to send results [March 26 07:02:07 UTC]
[07:02:07] - Reading file work/wuresults_07.dat from core
[07:02:07] + Attempting to get work packet
[07:02:07] (Read 91635626 bytes from disk)
[07:02:07] Connecting to http://128.143.231.201:8080/
[07:02:07] Passkey found
[07:02:07] - Will indicate memory of 64396 MB
[07:02:07] - Connecting to assignment server
[07:02:07] Connecting to http://assign.stanford.edu:8080/
[07:02:09] Posted data.
[07:02:09] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[07:02:09] + News From Folding@Home: Welcome to Folding@Home
[07:02:09] Loaded queue successfully.
[07:02:09] Sent data
[07:02:09] Connecting to http://128.143.231.201:8080/
[07:03:27] - Couldn't send HTTP request to server
[07:03:27] + Could not connect to Work Server
[07:03:27] - Attempt #1 to get work failed, and no other work to do.
Waiting before retry.
[07:03:43] + Attempting to get work packet
[07:03:43] Passkey found
[07:03:43] - Will indicate memory of 64396 MB
[07:03:43] - Connecting to assignment server
[07:03:43] Connecting to http://assign.stanford.edu:8080/
[07:03:44] Posted data.
[07:03:44] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[07:03:44] + News From Folding@Home: Welcome to Folding@Home
[07:03:44] Loaded queue successfully.
[07:03:44] Sent data
[07:03:44] Connecting to http://128.143.231.201:8080/
[07:05:45] Posted data.
It can not get work either!
Re: 128.143.231.201 (BA) acting up?
Posted: Wed Mar 26, 2014 8:33 am
by bollix47
Thank you for your reports ... PG notified.
Re: 128.143.231.201 (BA) acting up?
Posted: Wed Mar 26, 2014 10:16 am
by -alias-
Thanks for answer.
Now, all of my servers is standing like this!
Re: 128.143.231.201 (BA) acting up?
Posted: Wed Mar 26, 2014 10:20 am
by Buck Nasty
Same here with both my BA servers. Been hanging at upload for 6+ hrs.
Re: 128.143.231.201 (BA) acting up?
Posted: Wed Mar 26, 2014 11:05 am
by EXT64
I just stopped (exited) the client (v6) and restarted it, and it was able to download a new WU and successfully upload the stuck one, so this may be resolved now (give it another try and report back).
Re: 128.143.231.201 (BA) acting up?
Posted: Wed Mar 26, 2014 12:42 pm
by -alias-
Thanks EXT64
When I restarted all of my servers, they all started folding again, so PG have done something right!
Re: 128.143.231.201 (BA) acting up?
Posted: Wed Mar 26, 2014 1:31 pm
by bcavnaugh
I had this issue as well stuck at 97% all day. I Restarted my 4P Rig and it did complete the upload but I never did get a new project after that.
This was last night and so I was off line for almost 20 hours but I did get a P8101 about 30 minutes ago.
Re: 128.143.231.201 (BA) acting up?
Posted: Wed Mar 26, 2014 4:18 pm
by Buck Nasty
Shut the rigs down this morning, as I didn't want them idling all day. Just restarted them and everything uploaded/downloaded fine and back to full production with P8102 & P8105.
Project 8101 (R9, C9, G433)
Posted: Wed Mar 26, 2014 5:07 pm
by Nathan_P
Found this this afternoon when I noticed that one of my rigs was idle:-
It has been doing that in a loop for 20 hours! Cleared now but it may need purging from the server. PRCG is gleaned from HFM
Code: Select all
[20:00:37] Completed 240000 out of 250000 steps (96%)
[20:08:38] Completed 242500 out of 250000 steps (97%)
[20:16:41] Completed 245000 out of 250000 steps (98%)
[20:24:41] Completed 247500 out of 250000 steps (99%)
[20:32:44] Completed 250000 out of 250000 steps (100%)
[20:32:58] DynamicWrapper: Finished Work Unit: sleep=10000
[20:33:08]
[20:33:08] Finished Work Unit:
[20:33:08] - Reading up to 64206000 from "work/wudata_02.trr": Read 64206000
[20:33:09] trr file hash check passed.
[20:33:09] - Reading up to 31545708 from "work/wudata_02.xtc": Read 31545708
[20:33:09] xtc file hash check passed.
[20:33:09] edr file hash check passed.
[20:33:09] logfile size: 190203
[20:33:09] Leaving Run
[20:33:09] - Writing 96102787 bytes of core data to disk...
[20:33:30] Done: 96102275 -> 91390239 (compressed to 5.7 percent)
[20:33:30] ... Done.
[20:33:30] - Shutting down core
[20:33:30]
[20:33:30] Folding@home Core Shutdown: FINISHED_UNIT
[20:33:30] CoreStatus = 64 (100)
[20:33:30] Unit 2 finished with 81 percent of time to deadline remaining.
[20:33:30] Updated performance fraction: 0.794234
[20:33:30] Sending work to server
[20:33:30] Project: 8104 (Run 0, Clone 44, Gen 378)
[20:33:30] + Attempting to send results [March 25 20:33:30 UTC]
[20:33:30] - Reading file work/wuresults_02.dat from core
[20:33:31] (Read 91390751 bytes from disk)
[20:33:31] Connecting to http://128.143.231.201:8080/
[21:10:51] Posted data.
[21:10:51] Initial: 0000; - Uploaded at ~39 kB/s
[21:10:51] - Averaged speed for that direction ~39 kB/s
[21:10:51] + Results successfully sent
[21:10:51] Thank you for your contribution to Folding@Home.
[21:10:51] + Number of Units Completed: 142
[21:10:52] Trying to send all finished work units
[21:10:52] + No unsent completed units remaining.
[21:10:52] - Preparing to get new work unit...
[21:10:52] Cleaning up work directory
[21:10:52] + Attempting to get work packet
[21:10:52] Passkey found
[21:10:52] - Will indicate memory of 15998 MB
[21:10:52] - Connecting to assignment server
[21:10:52] Connecting to http://assign.stanford.edu:8080/
[21:10:53] Posted data.
[21:10:53] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[21:10:53] + News From Folding@Home: Welcome to Folding@Home
[21:10:53] Loaded queue successfully.
[21:10:53] Sent data
[21:10:53] Connecting to http://128.143.231.201:8080/
[21:10:53] Posted data.
[21:10:53] Initial: 0000; - Receiving payload (expected size: 512)
[21:10:53] Conversation time very short, giving reduced weight in bandwidth avg
[21:10:53] - Downloaded at ~1 kB/s
[21:10:53] - Averaged speed for that direction ~189 kB/s
[21:10:53] + Received work.
[21:10:53] Trying to send all finished work units
[21:10:53] + No unsent completed units remaining.
[21:10:53] + Closed connections
[21:10:53]
[21:10:53] + Processing work unit
[21:10:53] Core required: FahCore_a5.exe
[21:10:53] Core found.
[21:10:53] Working on queue slot 03 [March 25 21:10:53 UTC]
[21:10:53] + Working ...
[21:10:53] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 03 -np 48 -checkpoint 15 -verbose -lifeline 3796 -version 634'
[21:10:54]
[21:10:54] *------------------------------*
[21:10:54] Folding@Home Gromacs SMP Core
[21:10:54] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[21:10:54]
[21:10:54] Preparing to commence simulation
[21:10:54] - Looking at optimizations...
[21:10:54] - Created dyn
[21:10:54] - Files status OK
[21:10:54] Couldn't Decompress
[21:10:54] Called DecompressByteArray: compressed_data_size=0 data_size=0, decompressed_data_size=0 diff=0
[21:10:54] -Error: Couldn't update checksum variables
[21:10:54] Error: Could not open work file
[21:10:54]
[21:10:54] Folding@home Core Shutdown: FILE_IO_ERROR
[21:10:54] CoreStatus = 75 (117)
[21:10:54] Error opening or reading from a file.
[21:10:54] Deleting current work unit & continuing...
[21:10:54] Trying to send all finished work units
[21:10:54] + No unsent completed units remaining.
[21:10:54] - Preparing to get new work unit...
[21:10:54] Cleaning up work directory
[21:10:54] + Attempting to get work packet
[21:10:54] Passkey found
[21:10:54] - Will indicate memory of 15998 MB
[21:10:54] - Connecting to assignment server
[21:10:54] Connecting to http://assign.stanford.edu:8080/
[21:10:55] Posted data.
[21:10:55] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[21:10:55] + News From Folding@Home: Welcome to Folding@Home
[21:10:55] Loaded queue successfully.
[21:10:55] Sent data
[21:10:55] Connecting to http://128.143.231.201:8080/
[21:10:55] Posted data.
[21:10:55] Initial: 0000; - Receiving payload (expected size: 512)
[21:10:55] Conversation time very short, giving reduced weight in bandwidth avg
[21:10:55] - Downloaded at ~1 kB/s
[21:10:55] - Averaged speed for that direction ~168 kB/s
[21:10:55] + Received work.
[21:10:55] + Closed connections
[21:11:00]
[21:11:00] + Processing work unit
[21:11:00] Core required: FahCore_a5.exe
[21:11:00] Core found.
[21:11:00] Working on queue slot 04 [March 25 21:11:00 UTC]
[21:11:00] + Working ...
[21:11:00] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 04 -np 48 -checkpoint 15 -verbose -lifeline 3796 -version 634'
[21:11:00]
[21:11:00] *------------------------------*
[21:11:00] Folding@Home Gromacs SMP Core
[21:11:00] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[21:11:00]
[21:11:00] Preparing to commence simulation
[21:11:00] - Looking at optimizations...
[21:11:00] - Created dyn
[21:11:00] - Files status OK
[21:11:00] Couldn't Decompress
[21:11:00] Called DecompressByteArray: compressed_data_size=0 data_size=0, decompressed_data_size=0 diff=0
[21:11:00] -Error: Couldn't update checksum variables
[21:11:00] Error: Could not open work file
[21:11:00]
[21:11:00] Folding@home Core Shutdown: FILE_IO_ERROR
[21:11:00] CoreStatus = 75 (117)
[21:11:00] Error opening or reading from a file.
[21:11:00] Deleting current work unit & continuing...
[21:11:00] Trying to send all finished work units
[21:11:00] + No unsent completed units remaining.
[21:11:00] - Preparing to get new work unit...
[21:11:00] Cleaning up work directory
[21:11:01] + Attempting to get work packet
[21:11:01] Passkey found
[21:11:01] - Will indicate memory of 15998 MB
[21:11:01] - Connecting to assignment server
[21:11:01] Connecting to http://assign.stanford.edu:8080/
[21:11:01] Posted data.
[21:11:01] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[21:11:01] + News From Folding@Home: Welcome to Folding@Home
[21:11:02] Loaded queue successfully.
[21:11:02] Sent data
[21:11:02] Connecting to http://128.143.231.201:8080/
[21:11:02] Posted data.
[21:11:02] Initial: 0000; - Receiving payload (expected size: 512)
[21:11:02] Conversation time very short, giving reduced weight in bandwidth avg
[21:11:02] - Downloaded at ~1 kB/s
[21:11:02] - Averaged speed for that direction ~150 kB/s
[21:11:02] + Received work.
[21:11:02] + Closed connections
[21:11:07]
[21:11:07] + Processing work unit
[21:11:07] Core required: FahCore_a5.exe
[21:11:07] Core found.
[21:11:07] Working on queue slot 05 [March 25 21:11:07 UTC]
[21:11:07] + Working ...
[21:11:07] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 05 -np 48 -checkpoint 15 -verbose -lifeline 3796 -version 634'
[21:11:07]
[21:11:07] *------------------------------*
[21:11:07] Folding@Home Gromacs SMP Core
[21:11:07] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[21:11:07]
[21:11:07] Preparing to commence simulation
[21:11:07] - Looking at optimizations...
[21:11:07] - Created dyn
[21:11:07] - Files status OK
[21:11:07] Couldn't Decompress
[21:11:07] Called DecompressByteArray: compressed_data_size=0 data_size=0, decompressed_data_size=0 diff=0
[21:11:07] -Error: Couldn't update checksum variables
[21:11:07] Error: Could not open work file
[21:11:07]
[21:11:07] Folding@home Core Shutdown: FILE_IO_ERROR
[21:11:07] CoreStatus = 75 (117)
[21:11:07] Error opening or reading from a file.
[21:11:07] Deleting current work unit & continuing...
[21:11:07] Trying to send all finished work units
[21:11:07] + No unsent completed units remaining.
[21:11:07] - Preparing to get new work unit...
[21:11:07] Cleaning up work directory
[21:11:07] + Attempting to get work packet
[21:11:07] Passkey found
[21:11:07] - Will indicate memory of 15998 MB
[21:11:07] - Connecting to assignment server
[21:11:07] Connecting to http://assign.stanford.edu:8080/
[21:11:13] Posted data.
[21:11:13] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[21:11:13] + News From Folding@Home: Welcome to Folding@Home
[21:11:13] Loaded queue successfully.
[21:11:13] Sent data
[21:11:13] Connecting to http://128.143.231.201:8080/
[21:11:14] Posted data.
[21:11:14] Initial: 0000; - Receiving payload (expected size: 512)
[21:11:14] Conversation time very short, giving reduced weight in bandwidth avg
[21:11:14] - Downloaded at ~1 kB/s
[21:11:14] - Averaged speed for that direction ~133 kB/s
[21:11:14] + Received work.
[21:11:14] + Closed connections
[21:11:19]
[21:11:19] + Processing work unit
[21:11:19] Core required: FahCore_a5.exe
[21:11:19] Core found.
[21:11:19] Working on queue slot 06 [March 25 21:11:19 UTC]
[21:11:19] + Working ...
[21:11:19] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 06 -np 48 -checkpoint 15 -verbose -lifeline 3796 -version 634'
[21:11:19]
[21:11:19] *------------------------------*
[21:11:19] Folding@Home Gromacs SMP Core
[21:11:19] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[21:11:19]
[21:11:19] Preparing to commence simulation
[21:11:19] - Looking at optimizations...
[21:11:19] - Created dyn
[21:11:19] - Files status OK
[21:11:19] Couldn't Decompress
[21:11:19] Called DecompressByteArray: compressed_data_size=0 data_size=0, decompressed_data_size=0 diff=0
[21:11:19] -Error: Couldn't update checksum variables
[21:11:19] Error: Could not open work file
[21:11:19]
[21:11:19] Folding@home Core Shutdown: FILE_IO_ERROR
[21:11:19] CoreStatus = 75 (117)
[21:11:19] Error opening or reading from a file.
[21:11:19] Deleting current work unit & continuing...
[21:11:19] Trying to send all finished work units
[21:11:19] + No unsent completed units remaining.
[21:11:19] - Preparing to get new work unit...
[21:11:19] Cleaning up work directory
[21:11:19] + Attempting to get work packet
[21:11:19] Passkey found
[21:11:19] - Will indicate memory of 15998 MB
[21:11:19] - Connecting to assignment server
[21:11:19] Connecting to http://assign.stanford.edu:8080/
[21:11:35] Posted data.
[21:11:35] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[21:11:35] + News From Folding@Home: Welcome to Folding@Home
[21:11:35] Loaded queue successfully.
[21:11:35] Sent data
[21:11:35] Connecting to http://128.143.231.201:8080/
[21:11:36] Posted data.
[21:11:36] Initial: 0000; - Receiving payload (expected size: 512)
[21:11:36] Conversation time very short, giving reduced weight in bandwidth avg
[21:11:36] - Downloaded at ~1 kB/s
[21:11:36] - Averaged speed for that direction ~118 kB/s
[21:11:36] + Received work.
[21:11:36] + Closed connections
[21:11:41]
Re: 128.143.231.201 (BA) acting up?
Posted: Wed Mar 26, 2014 10:00 pm
by Nathan_P
More digging in the full log indicates that the machine above eventually connected to one of the smp servers and processed a WU before trying to reconnect to the BA server, a restart of the client did not work do I had to delete unitinfo.txt , queue.dat and machinedependant.dat and restart the client (v6) before it would properly connect and download a BA WU, its is now happily working on an 8104 WU. My other machine had just hung halfway through the upload, a restart of the client fixed things and it has downloaded an 8103 WU
For me things are back to normal, as always YMMV
Re: 128.143.231.201 (BA) acting up?
Posted: Thu Mar 27, 2014 4:05 pm
by EXT64
Thanks for the heads up Nathan_P, I had to delete those files on one of my bigadv rigs as well. It was weird - it would fail the download repeatedly, then randomly run a SMP WU. After deleting the files and restarting the client, I got an 8101 (so not quite as lucky as you).
Re: 128.143.231.201 (BA) acting up?
Posted: Sun Mar 30, 2014 7:41 pm
by Nathan_P
Is the BA Server acting up again? I've just had a 22 hour re run of my issue from the other day, trying to download a 512byte WU, This time with 81002 0,10,583. Same rig as last time so i'm not discounting the rig as being the culprit apart from it is now working on an 8574 with a tpf of 2:02
Re: 128.143.231.201 (BA) acting up?
Posted: Sun Mar 30, 2014 8:44 pm
by PinHead
Just found mine doing the same thing on an 8101.
Receiving payload ( expected size: 512)
Couldn't shake it by deleting the unitinfo.txt, queue.dat and machinedependant.dat files. Gave up and set it to SMP.
Re: 128.143.231.201 (BA) acting up?
Posted: Sun Mar 30, 2014 11:50 pm
by -alias-
I have been away for a few days, home again and find that 4 out of 6 servers are acting as the log from Nathan_P over here describes! What is going on?