Page 4 of 4

Re: Lost Time

Posted: Wed Jul 02, 2008 3:21 pm
by noorman
leexgx wrote:but thats an known problem in the SMP client (checking first post as i understood this was related to time wasted when it starts Sending the work unit when it could be downloading new project)

----------------------------

the only times SMP was talked about was on the second post and later on in the 2 and 3 page guess i was not looking at each post as much, i am talking about when it Gets to the FINISHED_UNIT stage (no error)

me and bruce seem to be on the lines of when its sending the work unit

your still thinking about SMP thats more an limitation of the MPI that thay use, to me this topic is about that delay when it start sending unit , but i see its turnd more into the MPI SMP problem

i still do not see why its not so hard to do an FINISHED_UNIT trigger to make it start an download at same times as upload if unit compleated with no error, as it does this if the client is restarted

Could well be that this thread started off with another client in mind, but I narrowed down to SMP from the 1st post I entered here.

I went with the Title of this thread and then for the item that loses me (and others) the most time; I can't do anything about the 4 minute pause, but I can reduce the upload time by choosing another ISP or another of his products with a higher upload speed ...

That 's why I went on about the time 'lost' after the upload is done (after a 100% successful finish), between it and a download, when the Client is really twisting its thumbs, freewheeling.
At that time, I could only try to prove things with log texts that I had from recent interventions on stalled SMP Wu's ...


I 've done a little programming in my school days and I don't know what can and cannot be done with the MPI, but I would try to stop the client after a 100% finish (after the last data block is written to harddisk) and then restart it ...
Without searching for a bug that I might not be able to fix (because maybe it is part of the MPI)

Re: Lost Time

Posted: Fri Jul 04, 2008 2:29 am
by codysluder
I don't see the "4 minutes" that you're talking about. This is about 17 seconds.

Code: Select all

[05:36:59] Leaving Run
[05:36:59] - Writing 399561 bytes of core data to disk...
[05:36:59]   ... Done.
[05:36:59] - Shutting down core
[05:36:59] 
[05:36:59] Folding@home Core Shutdown: FINISHED_UNIT
[05:37:02] CoreStatus = 64 (100)
[05:37:02] Unit 3 finished with 92 percent of time to deadline remaining.
[05:37:02] Updated performance fraction: 0.954642
[05:37:02] Sending work to server


[05:37:02] + Attempting to send results
[05:37:02] - Reading file work/wuresults_03.dat from core
[05:37:02]   (Read 399561 bytes from disk)
[05:37:02] Connecting to http://169.230.26.30:8080/
[05:37:11] Posted data.
[05:37:12] Initial: 0000; - Uploaded at ~39 kB/s
[05:37:12] - Averaged speed for that direction ~37 kB/s
[05:37:12] + Results successfully sent
[05:37:12] Thank you for your contribution to Folding@Home.
[05:37:12] + Number of Units Completed: 183

[05:37:16] Trying to send all finished work units
[05:37:16] + No unsent completed units remaining.
[05:37:16] - Preparing to get new work unit...
[05:37:16] + Attempting to get work packet
[05:37:16] - Will indicate memory of 511 MB
[05:37:16] - Detect CPU. Vendor: GenuineIntel, Family: 15, Model: 2, Stepping: 9
[05:37:16] - Connecting to assignment server
[05:37:16] Connecting to http://assign.stanford.edu:8080/
[05:37:16] Posted data.
[05:37:16] Initial: E6A9; - Successful: assigned to (169.230.26.30).
[05:37:16] + News From Folding@Home: Welcome to Folding@Home
[05:37:16] Loaded queue successfully.
[05:37:16] Connecting to http://169.230.26.30:8080/
[05:37:16] Posted data.
[05:37:16] Initial: 0000; - Receiving payload (expected size: 16796)
[05:37:16] Conversation time very short, giving reduced weight in bandwidth avg
[05:37:16] - Downloaded at ~32 kB/s
[05:37:16] - Averaged speed for that direction ~24 kB/s
[05:37:16] + Received work.

Re: Lost Time

Posted: Fri Jul 04, 2008 9:44 am
by noorman
.

I 've never posted an example of the 4 minute pause (yet), so here goes:
[16:17:35] Completed 500000 out of 500000 steps (100 percent)
[16:17:35] Writing final coordinates.
[16:17:35] Past main M.D. loop
[16:17:35] Will end MPI now
[16:18:35]
[16:18:35] Finished Work Unit:
[16:18:35] - Reading up to 3721056 from "work/wudata_09.arc": Read 3721056
[16:18:35] - Reading up to 1775588 from "work/wudata_09.xtc": Read 1775588
[16:18:35] goefile size: 0
[16:18:35] logfile size: 18167
[16:18:35] Leaving Run
[16:18:39] - Writing 5519211 bytes of core data to disk...
[16:18:39] ... Done.
[16:18:39] - Shutting down core
[16:18:39]
[16:18:39] Folding@home Core Shutdown: FINISHED_UNIT
[16:18:44] CoreStatus = 64 (100)
[16:18:44] Unit 9 finished with 80 percent of time to deadline remaining.
[16:18:44] Updated performance fraction: 0.801272
[16:18:44] Sending work to server


[16:18:44] + Attempting to send results
[16:18:44] - Reading file work/wuresults_09.dat from core
[16:18:44] (Read 5519211 bytes from disk)
[16:18:44] Connecting to http://171.64.65.56:8080/
[16:20:16] Posted data.
[16:20:16] Initial: 0000; - Uploaded at ~57 kB/s
[16:20:17] - Averaged speed for that direction ~52 kB/s
[16:20:17] + Results successfully sent
[16:20:17] Thank you for your contribution to Folding@Home.
[16:20:17] + Number of Units Completed: 29

[16:24:21] - Warning: Could not delete all work unit files (9): Core returned invalid code
[16:24:21] Trying to send all finished work units
[16:24:21] + No unsent completed units remaining.
[16:24:21] - Preparing to get new work unit...
[16:24:21] + Attempting to get work packet
[16:24:21] - Connecting to assignment server
[16:24:21] Connecting to http://assign.stanford.edu:8080/
[16:24:22] Posted data.
[16:24:22] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[16:24:22] + News From Folding@Home: Welcome to Folding@Home
[16:24:22] Loaded queue successfully.
[16:24:22] Connecting to http://171.64.65.56:8080/
[16:24:25] Posted data.
[16:24:25] Initial: 0000; - Receiving payload (expected size: 2421110)
[16:24:29] - Downloaded at ~591 kB/s
[16:24:29] - Averaged speed for that direction ~566 kB/s
[16:24:29] + Received work.
[16:24:29] Trying to send all finished work units
[16:24:29] + No unsent completed units remaining.
[16:24:29] + Closed connections
[16:24:29]
[16:24:29] + Processing work unit
[16:24:29] Core required: FahCore_a1.exe
[16:24:29] Core found.
[16:24:29] Working on Unit 00 [June 27 16:24:29]
[16:24:29] + Working ...
[16:24:29] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 00 -checkpoint 3 -verbose -lifeline 7054 -version 602'

[16:24:30]
[16:24:30] *------------------------------*
[16:24:30] Folding@Home Gromacs SMP Core
[16:24:30] Version 1.74 (November 27, 2006)
[16:24:30]
[16:24:30] Preparing to commence simulation
[16:24:30] - Ensuring status. Please wait.
[16:24:30] - Starting from initial work packet
[16:24:30]
[16:24:30] Project: 2605 (Run 17, Clone 101, Gen 66)
.
[11:21:55] Completed 500000 out of 500000 steps (100 percent)
[11:21:55] Writing final coordinates.
[11:21:55] Past main M.D. loop
[11:21:55] Will end MPI now
[11:22:55]
[11:22:55] Finished Work Unit:
[11:22:55] - Reading up to 3721200 from "work/wudata_00.arc": Read 3721200
[11:22:55] - Reading up to 1774092 from "work/wudata_00.xtc": Read 1774092
[11:22:55] goefile size: 0
[11:22:55] logfile size: 16918
[11:22:55] Leaving Run
[11:22:56] - Writing 5516610 bytes of core data to disk...
[11:22:56] ... Done.
[11:22:56] - Shutting down core
[11:22:56]
[11:22:56] Folding@home Core Shutdown: FINISHED_UNIT
[11:23:01] CoreStatus = 64 (100)
[11:23:01] Unit 0 finished with 80 percent of time to deadline remaining.
[11:23:01] Updated performance fraction: 0.801485
[11:23:01] Sending work to server


[11:23:01] + Attempting to send results
[11:23:01] - Reading file work/wuresults_00.dat from core
[11:23:01] (Read 5516610 bytes from disk)
[11:23:01] Connecting to http://171.64.65.56:8080/
[11:24:32] Posted data.
[11:24:32] Initial: 0000; - Uploaded at ~58 kB/s
[11:24:33] - Averaged speed for that direction ~53 kB/s
[11:24:33] + Results successfully sent
[11:24:33] Thank you for your contribution to Folding@Home.
[11:24:33] + Number of Units Completed: 30

[11:28:37] - Warning: Could not delete all work unit files (0): Core returned invalid code
[11:28:37] Trying to send all finished work units
[11:28:37] + No unsent completed units remaining.
[11:28:37] - Preparing to get new work unit...
[11:28:37] + Attempting to get work packet
[11:28:37] - Connecting to assignment server
[11:28:37] Connecting to http://assign.stanford.edu:8080/
[11:28:38] Posted data.
[11:28:38] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[11:28:38] + News From Folding@Home: Welcome to Folding@Home
[11:28:38] Loaded queue successfully.
[11:28:38] Connecting to http://171.64.65.56:8080/
[11:28:42] Posted data.
[11:28:42] Initial: 0000; - Receiving payload (expected size: 2441603)
[11:28:46] - Downloaded at ~596 kB/s
[11:28:46] - Averaged speed for that direction ~572 kB/s
[11:28:46] + Received work.
[11:28:46] Trying to send all finished work units
[11:28:46] + No unsent completed units remaining.
[11:28:46] + Closed connections
[11:28:46]
[11:28:46] + Processing work unit
[11:28:46] Core required: FahCore_a1.exe
[11:28:46] Core found.
[11:28:46] Working on Unit 01 [June 28 11:28:46]
[11:28:46] + Working ...
[11:28:46] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 01 -checkpoint 3 -verbose -lifeline 7054 -version 602'

[11:28:46]
[11:28:46] *------------------------------*
[11:28:46] Folding@Home Gromacs SMP Core
[11:28:46] Version 1.74 (November 27, 2006)
[11:28:46]
[11:28:46] Preparing to commence simulation
[11:28:46] - Ensuring status. Please wait.
[11:28:46] - Starting from initial work packet
[11:28:46]
[11:28:46] Project: 2605 (Run 8, Clone 523, Gen 67)
.

.

Re: Lost Time

Posted: Sat Jul 05, 2008 5:44 am
by bruce
noorman wrote:.I 've never posted an example of the 4 minute pause (yet), so here goes:.
I said earlier that people were mixing up issues and I though we all agreed that we were not talking about problems with SMP in this discussion. The four minute pause is an issue for the SMP beta. That's how long it takes the SMP cores to clean up the work files at the end of any WU. Development is aware of this issue and it will be corrected in some future beta version. That case should be excluded from the rest of the discussion since it has nothing to do with the sequence of uploading or downloading. Read the list of SMP Known Problems.

Re: Lost Time

Posted: Sat Jul 05, 2008 10:02 am
by noorman
bruce wrote:
noorman wrote:.I 've never posted an example of the 4 minute pause (yet), so here goes:.
I said earlier that people were mixing up issues and I though we all agreed that we were not talking about problems with SMP in this discussion. The four minute pause is an issue for the SMP beta. That's how long it takes the SMP cores to clean up the work files at the end of any WU. Development is aware of this issue and it will be corrected in some future beta version. That case should be excluded from the rest of the discussion since it has nothing to do with the sequence of uploading or downloading. Read the list of SMP Known Problems.
.


Sorry, NOT true, again !

This happens in the Linux SMP I 'm running; that is now a non-beta release and I think that this issue hasn't been fixed at all !

I 'll retract this when I see evidence of the contrary ! / I 've seen no reports of a fix of this or anything else at all.
Stanford deemed this Client to be working well for long enough and have changed its status according to that; I don't agree that it is working well enough, notably by the fact that I have to intervene on a regular basis to fix the queue, on both my Linux SMP rigs & by the fact that the 4 minute time loss hasn't been lowered to an acceptable amount in refernce to the deadlines of SMP WU's.
( the WU always finishes to 100%, then the status gets in a knot and it can't be sent & nothing can be downloaded )

It may be that the issue is known, as is the more visual fault of the error message which indicates that not all (if any) of the 'old' work files have been deleted !
The issue of a 4 minute loss is still there and I doubt it that any modern PC needs 4 minutes to clean up some files!
The fact that they are not cleaned up now and that this doesn't affect the running of FaH in any (significant) way demonstrates that.
In the end, when a WU has been finished or has even crashed, its work files can just be deleted (as the error message tells us it wants to do) !
That can happen in a flash ....


.

Re: Lost Time

Posted: Sat Jul 05, 2008 10:23 pm
by leexgx
Bruce did not state if it was WIN SMP or Linux SMP, the problem is there for both clients beta or not
Pease stop bring the SMP client up as its an known problem its something to do with the MPI that is not open source, the fix for it and All panda F@H client software is to start download of new core when { Folding@home Core Shutdown: FINISHED_UNIT/ CoreStatus = 64 (100) } happens, if an errror happens there be no core 100 or FINISHED_UNIT so it do the norm way of sending then getting an new work for buged out computers, this would resove you problem noorman

Re: Lost Time

Posted: Sat Jul 05, 2008 11:38 pm
by noorman
leexgx wrote:Bruce did not state if it was WIN SMP or Linux SMP, the problem is there for both clients beta or not
Pease stop bring the SMP client up as its an known problem its something to do with the MPI that is not open source, the fix for it and All panda F@H client software is to start download of new core when { Folding@home Core Shutdown: FINISHED_UNIT/ CoreStatus = 64 (100) } happens, if an errror happens there be no core 100 or FINISHED_UNIT so it do the norm way of sending then getting an new work for buged out computers, this would resove you problem noorman
.


That 's indeed a solution I would have expected in a final Client version, whatever Folding type it is ...

Till further notice, this hasn't been resolved & it stays a troublesome (irritant) fault condition !
I find it sad that this was not resolved before the client was declared 'final'.

If the energy bills keep going up, I might have to stop Folding anyway; then it will not matter anymore, not to me anyway :(
The A64X2 isn't really efficient anymore; that would be the 1st to go down.
Would be a good main rig replacement for the current Athlon Barton 2500+ ...
( the 2 only duallies I have have never been used for anything else but FaH )


.

Re: Lost Time

Posted: Sun Jul 06, 2008 5:34 am
by 7im
Until noorman can decide to stay ON TOPIC, or creates his own SMP thread, he stays a troublesome irritant was well, so this thread is closed. ;)