Page 3 of 4

Re: Lost Time

Posted: Mon Jun 30, 2008 10:13 am
by noorman
.

Thanks !

I see that there 's improvement !



I hope that SMP gets fixed ...


.

Re: Lost Time

Posted: Mon Jun 30, 2008 10:42 am
by ^w^ing
Actually i think that only SMP has that prolonged delay between WUs because of all the cores need to shutdown and other things need to be cleaned up. Theres not such a long delay in the regular CPU clients either.

Re: Lost Time

Posted: Mon Jun 30, 2008 11:12 am
by noorman
^w^ing wrote:Actually i think that only SMP has that prolonged delay between WUs because of all the cores need to shutdown and other things need to be cleaned up. Theres not such a long delay in the regular CPU clients either.
.

The shutdown of the 4 FahCore_ax.exe takes seconds, not minutes and that shutdown occurs before the results are sent back (upload) !

It 's after that that the client starts idling, not running anything !

I recently closed down the Client with a queue.dat fault which prohibited the upload of the finished unit.
I fixed that problem,
then re-started the Client.

Immediately it started the 'upload' sequence & after that it also started the 'download' sequence.
So, whilst it was uploading it got a connect with the work sever and downloaded a fresh WU without any delay !
A bit after the download had finished I got the message that the 'upload' had finished.
( I have Asymmetric Cable connection; slow upload, fast download )

Even if Stanford wanted to protect against WAN & LAN communication problems by keeping 'up' and 'down' apart, the upload could finish and straight afterwards the download could be started.
Even better; the download could happen and straight after it an upload transfer could be done.
Happens regularly, when for any reason, the Results cannot be accepted by Stanford and the Results stay in Queue, to be transmitted later, during a Client run ...


That doesn't happen though; the Client uploads the results and then takes a Coffee break ...


.

Re: Lost Time

Posted: Mon Jun 30, 2008 2:35 pm
by ^w^ing
I was refering to numbers 1 & 2 in the list of known SMP bugs and issues ( viewtopic.php?f=8&t=50 )
1) The core needs some time (often as much as 4 minutes) between work units to finalize work and move on. It may also need some time AFTER you stop the client. Do not restart the client without checking if the copies of FahCore_a1 have stopped.

2) There is a brief pause (15-20 seconds) at the end of each WU. This is so we can make sure all the threads sync up. This is not a bug, as much as a limitation of SMP needing to synchronize the threads before moving on to the next WU.
Possibly the same as #1
The point is, even if the client downloaded next WU before/right after a WU is finished, it couldnt start processing it right away, because the WU would eventually fail. The pause was prolonged on purpose, theres a reason why the SMP client doesnt do anything for a few minutes. For other clients, downloading before or whilst uploading, that would make sense because IT WOULD (probably) WORK. For the current SMP client, thats not true.

edit: also, the current assignment system wouldnt let you download new WU before handing over your previous one. It would just send you the same WU again, just like when having a SMP EUE without sending back partial results.

Re: Lost Time

Posted: Mon Jun 30, 2008 3:35 pm
by noorman
^w^ing wrote:I was refering to numbers 1 & 2 in the list of known SMP bugs and issues ( viewtopic.php?f=8&t=50 )
1) The core needs some time (often as much as 4 minutes) between work units to finalize work and move on. It may also need some time AFTER you stop the client. Do not restart the client without checking if the copies of FahCore_a1 have stopped.

2) There is a brief pause (15-20 seconds) at the end of each WU. This is so we can make sure all the threads sync up. This is not a bug, as much as a limitation of SMP needing to synchronize the threads before moving on to the next WU.
Possibly the same as #1
The point is, even if the client downloaded next WU before/right after a WU is finished, it couldnt start processing it right away, because the WU would eventually fail. The pause was prolonged on purpose, theres a reason why the SMP client doesnt do anything for a few minutes. For other clients, downloading before or whilst uploading, that would make sense because IT WOULD (probably) WORK. For the current SMP client, thats not true.

edit: also, the current assignment system wouldnt let you download new WU before handing over your previous one. It would just send you the same WU again, just like when having a SMP EUE without sending back partial results.
.


I have to contradict this; I had repaired the queue.dat with Qfix, after a WU had (once again) finished 100%, but not uploaded !
Then started the client again, the client found the complete results and uploaded them whilst it downloaded a new WU.
That WU didn't crash and fail, but went on to finish 100% too.
This timùe it got sent automatically.

I 've had more than 4 of these events now, so I know what I 'm talking about / I 'm running the SMP Linux v6 Client (by the way)
In all 4 instances, I had a completely finished WU (100%) without errors that didn't get uploaded by the faulty queue.dat entry (Status was wrong).
After fixing that, it was sent immediately whilst a WU was downloaded and started (also immediately).
After the fixes, the next WU went to 100% and got uploaded (as normal).

So, no link between an upload and a download without pause (before it) and a crashing WU as a consequence !

To demonstrate it, I 'll put in a log copy of such an event: (from 1 of the Linux machines / not this one)

Code: Select all

# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/noorman/Folding@Home
Executable: ./fah6
Arguments: -smp -delete 02 

[09:37:02] - Ask before connecting: No
[09:37:02] - User name: noorman (Team 734)
[09:37:02] - User ID: 48B83D25538777D9
[09:37:02] - Machine ID: 1
[09:37:02] 
[09:37:03] Loaded queue successfully.
[09:37:03] Deleting work unit #2 from work queue...
[09:41:24] - Failed to delete the requested work unit

Folding@Home Client Shutdown.


--- Opening Log file [June 24 09:42:18] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/noorman/Folding@Home
Executable: ./fah6
Arguments: -smp -verbosity 9 

[09:42:18] - Ask before connecting: No
[09:42:18] - User name: noorman (Team 734)
[09:42:18] - User ID: 48B83D25538777D9
[09:42:18] - Machine ID: 1
[09:42:18] 
[09:42:18] Loaded queue successfully.
[09:42:18] - Autosending finished units...
[09:42:18] Trying to send all finished work units


[09:42:18] + Attempting to send results
[09:42:18] - Reading file work/wuresults_02.dat from core
[09:42:18]   (Read 5530530 bytes from disk)
[09:42:18] Connecting to http://171.64.65.56:8080/
[09:42:18] - Preparing to get new work unit...
[09:42:18] + Attempting to get work packet
[09:42:18] - Will indicate memory of 2014 MB
[09:42:18] - Detect CPU. Vendor: AuthenticAMD, Family: 15, Model: 3, Stepping: 2
[09:42:18] - Connecting to assignment server
[09:42:18] Connecting to http://assign.stanford.edu:8080/
[09:42:19] Posted data.
[09:42:19] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[09:42:19] + News From Folding@Home: Welcome to Folding@Home
[09:42:19] Loaded queue successfully.
[09:42:19] Connecting to http://171.64.65.56:8080/
[09:42:23] Posted data.
[09:42:23] Initial: 0000; - Receiving payload (expected size: 2444530)
[09:42:32] - Downloaded at ~265 kB/s
[09:42:32] - Averaged speed for that direction ~485 kB/s
[09:42:32] + Received work.
[09:42:32] + Closed connections
[09:42:32] 
[09:42:32] + Processing work unit
[09:42:32] Core required: FahCore_a1.exe
[09:42:32] Core found.
[09:42:32] Working on Unit 03 [June 24 09:42:32]
[09:42:32] + Working ...
[09:42:32] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 03 -checkpoint 3 -verbose -lifeline 14561 -version 602'

[09:42:32] 
[09:42:32] *------------------------------*
[09:42:32] Folding@Home Gromacs SMP Core
[09:42:32] Version 1.74 (November 27, 2006)
[09:42:32] 
[09:42:32] Preparing to commence simulation
[09:42:32] - Ensuring status. Please wait.
[09:42:49] - Looking at optimizations...
[09:42:49] - Working with standard loops on this execution.
[09:42:49] - Previous termination of core was improper.
[09:42:49] - Going to use standard loops.
[09:42:49] - Files status OK
[09:42:50] - Expanded 2444018 -> 1290766- Starting from initial work packet
[09:42:50] 
[09:42:50] Project: 2605 (Run 12, Clone 127, Gen 65)
[09:42:50] 
[09:42:50] Entering M.D.
[09:42:50] ne 127, Gen 65)
[09:42:50] 
[09:42:50] Entering M.D.
[09:42:57] les
[09:42:57] cal files
[09:42:57] in in POPC
[09:42:57] Writing local files
[09:42:57] Extra SSE boost OK.
[09:42:58] 0000 steps  (0 percent)
[09:43:51] Posted data.
[09:43:51] Initial: 0000; - Uploaded at ~57 kB/s
[09:43:52] - Averaged speed for that direction ~57 kB/s
[09:43:52] + Results successfully sent
[09:43:52] Thank you for your contribution to Folding@Home.
[09:43:52] + Number of Units Completed: 2

[09:43:53] + Sent 1 of 1 completed units to the server
[09:43:53] - Autosend completed
[09:45:59] Timered checkpoint triggered.
[09:48:58] Timered checkpoint triggered.
[09:51:58] Timered checkpoint triggered.
[09:54:58] Timered checkpoint triggered.
[09:57:58] Timered checkpoint triggered.
[10:00:58] Timered checkpoint triggered.
[10:03:59] Timered checkpoint triggered.
[10:06:31] Writing local files
[10:06:31] Completed 5000 out of 500000 steps  (1 percent)
[10:09:31] Timered checkpoint triggered.
[10:12:31] Timered checkpoint triggered.
[10:15:31] Timered checkpoint triggered.
[10:18:31] Timered checkpoint triggered.
[10:21:31] Timered checkpoint triggered.
[10:24:31] Timered checkpoint triggered.
[10:27:31] Timered checkpoint triggered.
[10:30:06] Writing local files
[10:30:06] Completed 10000 out of 500000 steps  (2 percent)
[10:33:07] Timered checkpoint triggered.
[10:36:07] Timered checkpoint triggered.
[10:39:06] Timered checkpoint triggered.
[10:42:07] Timered checkpoint triggered.
[10:45:07] Timered checkpoint triggered.
[10:48:06] Timered checkpoint triggered.
[10:51:07] Timered checkpoint triggered.
[10:53:42] Writing local files

snip


quote tags changed to code tags, and redundant log content snipped out. 7im

Re: Lost Time

Posted: Mon Jun 30, 2008 4:04 pm
by ^w^ing
Well if you had to repair the queue.dat, you had to shutdown the client, that also means shutting down the cores. So when you started the client, it was past the procedure for which the pause is needed and the client really could download new WU & upload results. And i didnt say there is a link between uploading and downloading new WU without pause, the link is between the 'old' cores shutting down and the 'new' cores starting up without pause.

I'll have to step out tho, because i dont know what of the WinSMP issues applies to Linux SMP and I thought you were running the Win one. I still think that what i said applies even for the Linux SMP clients.

Re: Lost Time

Posted: Mon Jun 30, 2008 5:31 pm
by noorman
^w^ing wrote:Well if you had to repair the queue.dat, you had to shutdown the client, that also means shutting down the cores. So when you started the client, it was past the procedure for which the pause is needed and the client really could download new WU & upload results. And i did'nt say there is a link between uploading and downloading new WU without pause, the link is between the 'old' cores shutting down and the 'new' cores starting up without pause.

I'll have to step out tho, because i dont know what of the WinSMP issues applies to Linux SMP and I thought you were running the Win one. I still think that what i said applies even for the Linux SMP clients.
.


:?: :?: :?:

If I kill the client before it realises the WU has been finished (faulty queue.dat) and nothing has been sent, how then can there pass time needed for I don't know what (still) to happen that hasn't been instigated yet ?
If I then restart the Client and it does an upload of Results whilst connecting to the work server for a new WU, where has been done what ?
There hasn't been time to process anything ...

I compare it to the situation where the upload can't happen for some reason or another and the results are kept in queue; then when later on, the server comes back up and the result can be sent, where then has there been a process that links these Results to the currently running WU ???
Impossible !

There is no need for the Results to be processed before another WU can be launched because there are already WU's waiting to be assigned and the WU that gets its new data/parameters from my uploaded Results can be assigned to the next computer that logs on !
So, why should I be waiting for any processing that needs to be done on my Results ???
No need for that at all.

i 'm still telling you that there is a link between the 'partially broken' -delete function (with a 4 minute delay/pause) and the pause we see after each WU finish !


.

Re: Lost Time

Posted: Mon Jun 30, 2008 6:15 pm
by ^w^ing
noorman wrote:If I kill the client before it realises the WU has been finished (faulty queue.dat) and nothing has been sent, how then can there pass time needed for I don't know what (still) to happen that hasn't been instigated yet ?
If I then restart the Client and it does an upload of Results whilst connecting to the work server for a new WU, where has been done what ?
There hasn't been time to process anything ...
Lets call it magic. When the WU is finished, magic happens. When you kill the client after completion of the WU, thats the same magic. Then you use qfix and start up the client again. The magic has been done already, so no need for it anymore, your client sends your results and downloads new WU. Now it doesnt wait until sending the work, it downloads new WU immediately.
noorman wrote:I compare it to the situation where the upload can't happen for some reason or another and the results are kept in queue; then when later on, the server comes back up and the result can be sent, where then has there been a process that links these Results to the currently running WU ???
Impossible !

There is no need for the Results to be processed before another WU can be launched because there are already WU's waiting to be assigned and the WU that gets its new data/parameters from my uploaded Results can be assigned to the next computer that logs on !
So, why should I be waiting for any processing that needs to be done on my Results ???
No need for that at all.
No, I was talking about what 7im said in the last post on first page of this topic. The assignment server doesnt need your finished WU for assigning you a new one, it just wants your finished WU before sending you the new one. This is how it works BEFORE you shutdown the client.

Re: Lost Time

Posted: Mon Jun 30, 2008 6:18 pm
by bruce
@Noorman

Please stay on a single topic. So far you've complained about at least three different issues which potentially have three different solutions.
1) Download of new work should precede upload or happen simultaneously whenever possible.
2) The SMP cores take anywhere from 0 to 4 minutes to shut down after the last message you see.
3) Certain errors do delete WUs which can be recovered by qfix (most notably the SMP client)

This is turning into your own personal gripe list about everything that's wrong with FAH and that's not constructive -- mainly because discussions about one topic can easily get overlooked when they're embedded in a thread about something else.

Nobody is saying that the issues don't need attention -- just that none of them are on an equal priority with other issues that are keeping development busy. People are also suggesting that you (and the rest of us, too) are not able to consider all of the pros and cons because we don't always see the "big picture" of FAH and all the possible interactions of a single change.

When the guys in development get "free time" (ya, sure) they generally reconsider everything that's one the lower priority list and work on whatever will make the best scientific improvements.

As far as fixing SMP for Windows is concerned, it's one of several high priority issues, but there's a prerequisite. The Pande Group has to have access to a version of MPI that's both dependable and fast. Until that happens, the emphasis will be on other high priority tasks.

Re: Lost Time

Posted: Tue Jul 01, 2008 10:44 pm
by leexgx
1) Download of new work should precede upload or happen simultaneously whenever possible.
2 and 3 is not related to this and them 2 things relate on that the cores Must shut down all 4 cores before it can start sending

best to Only download if it compleats (Only when FINISHED_UNIT /CoreStatus = 64 (100) As once that comes up the core is closed smp or not) the work unit successfully then do the download and send work back the server and start working on the next one as its still sending back the last one
reason you will not get the same work unit back is that the client has an 8 slot que inside the client, it would use the next que slot for the work it has just downloaded that would be an new work unit, you will only get the same work unit if you manually del the work folder before it compleats, or use the same machine ID and open an second copy in an dif working folder

if the compleated work unit is big (say more then 2-3 MB) FAH is wasteing time uploading when it could be working on the next work unit as it is uploading

and is an 4 mins is avg for how long (dont just quote that its just it can take 4 mins) it can take to send work units when it fails to send or just so big (+ alot of users ISP has less then 128kb/16KB upload but 2mb/256KB down) but on Big work units it can just take 15 mins to Send the work unit (as some of them can be up to 15MB in size)

doing the above is the same as restarting the client (auto downloads and sends on start up if work unit is awateing to be sent + no work unit to be worked on)

Re: Lost Time

Posted: Wed Jul 02, 2008 1:20 am
by noorman
bruce wrote:@Noorman

Please stay on a single topic. So far you've complained about at least three different issues which potentially have three different solutions.
1) Download of new work should precede upload or happen simultaneously whenever possible.
I just pointed out 2 possible solutions to the 4 minute pause ...
The topic I talk about is nothing but the 'Lost Time' !
2) The SMP cores take anywhere from 0 to 4 minutes to shut down after the last message you see.
The same is the previous point; the FahCores are shutting down within seconds of the WU getting to 100%; it 's after that that the 4 minute idling takes place, not 0-4 mins !
3) Certain errors do delete WUs which can be recovered by qfix (most notably the SMP client)
I was only using facts - found during the fixing of a faulty queue.dat (Status) entry - to demonstrate my findings concerning the pause that was not in effect during that repair sequence I performed; I demonstrated that an upload and download can run simultaneously, thus proving there was no need for the 4 minute delay/pause.
After that event, there was also NO crash or EUE of the following WU, proving that the simultaneous up- and down- communication had no ill effects on the aftermath !
This is turning into your own personal gripe list about everything that's wrong with FAH and that's not constructive -- mainly because discussions about one topic can easily get overlooked when they're embedded in a thread about something else.

Nobody is saying that the issues don't need attention -- just that none of them are on an equal priority with other issues that are keeping development busy. People are also suggesting that you (and the rest of us, too) are not able to consider all of the pros and cons because we don't always see the "big picture" of FAH and all the possible interactions of a single change.

When the guys in development get "free time" (ya, sure) they generally reconsider everything that's one the lower priority list and work on whatever will make the best scientific improvements.

As far as fixing SMP for Windows is concerned, it's one of several high priority issues, but there's a prerequisite. The Pande Group has to have access to a version of MPI that's both dependable and fast. Until that happens, the emphasis will be on other high priority tasks.
There is nothing personal in this; I 'm only trying to get the necessary attention to the problem, which is every SMP Folder's problem & also that of the project.

Since the SMP WU's finish that more quickly than the old single-core WU's (or they expire), the proportion of the 'pause' to the total WU run-time has become much bigger, so therefor more important; cfr. 4 mins to a few weeks/months compared to 4 mins to less than a day.
Another point is that Folders/volunteers who invest a lot in FaH, with dedicated Folding, checking for errors, repairing errors, checking on News & Forums, reporting on FaH, etc. should be able to get something like this 'upped' in priority; certainly now that also Energy prices are rising and rising, as do other costs, there is a good case to prioritize this time-wasting problem !

It 's a beta, it 's a bug, it needs a fix and that fix could have been implemented before the release of the v6 Client ...
Another point is the fact that the GPU client does NOT have this fault/bug (I verified that with a copy of the log of another Team member who runs GPU2), demonstrating that a turnaround of 3-4 seconds is possible.

And about 'the Science'; all those lost days (see calculation in earlier post) are also lost Science for the Project, not only losses for the dedicated Folder volunteers !


.

Re: Lost Time

Posted: Wed Jul 02, 2008 2:45 am
by leexgx
this is an problem that is for all clients not just SMP

most of the time it does it in 3 secs (for my single console clients) but thats if its an small work unit (my GPU2 takes 1 min to send before it gets the next work units that starts within 5 sec), its just when it doing an Big send, it could be downloading next work unit and working on it as it is sending last one

Re: Lost Time

Posted: Wed Jul 02, 2008 5:37 am
by bruce
leexgx wrote:this is an problem that is for all clients not just SMP
True.

This same topic came up back in the days of V4 and the Pande Group has not responded to the enhancement requests that were made then. (Technically, it's not a bug, it's a request for an enhancement. -- Bugs are things that do not work. In this case the client works as designed.)

The client has always been able to upload and download at the same time. It just doesn't do it during the first upload attempt.

One of the V4 bugs that was fixed long ago:
Suppose the upload fails. The client leaves the WU in the upload queue to be retried later and moves on to download a new WU. Suppose that in the few seconds between the failed upload attempt and the new download request, the server get fixed. You used to get the same WU reassigned. New logic was added so that the same WU was NOT reassigned as long as the WU is in the upload queue.

They'll need to make sure that the same logic applies if the upload and download happen simultaneously.

Re: Lost Time

Posted: Wed Jul 02, 2008 10:12 am
by noorman
bruce wrote:
leexgx wrote:this is an problem that is for all clients not just SMP
True.

This same topic came up back in the days of V4 and the Pande Group has not responded to the enhancement requests that were made then. (Technically, it's not a bug, it's a request for an enhancement. -- Bugs are things that do not work. In this case the client works as designed.)

The client has always been able to upload and download at the same time. It just doesn't do it during the first upload attempt.

One of the V4 bugs that was fixed long ago:
Suppose the upload fails. The client leaves the WU in the upload queue to be retried later and moves on to download a new WU. Suppose that in the few seconds between the failed upload attempt and the new download request, the server get fixed. You used to get the same WU reassigned. New logic was added so that the same WU was NOT reassigned as long as the WU is in the upload queue.

They'll need to make sure that the same logic applies if the upload and download happen simultaneously.
.

1) leexgx is off the ball; I never talked about the upload time - being a problem - and I know that the upload time will be different for everyone and for every project, depending on connection speed and/or project size !
I 'm talking about minutes of idling when effectively nothing is happening or nothing that couldn't be done in seconds rather than (the same old) 4 minutes ...

2) I don't see why, if reassignment is needed , a WU couldn't be reassigned to any other computer that 's running the correct/needed Client and why (in this case) my or any SMP-Client should be waiting on an acceptance protocol or something else ...

Just give the machine new work ! Dedicated machines are there to work ! (and I 'm not alone in thinking that / I read the reports in my forum when something disables normal traffic between Stanford and the Team's Folding systems)
WHY ?
If the previous WU was finished completely and without errors, its results should get back to Stanford (from the queue if need be), hopefully 'in time'.
If the Client/WU has crashed or the WU can't be sent to Stanford in time, the work will be lost anyway. This will become obvious when the preferred deadline passes.
At that time, the Wu can (and will) be assigned to someone else; it would even be preferable to give it to somebody else since there seems to be at least 1 problem with the system (or connection) of the system that passed the deadline ...

I would understand the 'acceptance protocol' if there only were 2 or 3 systems running FaH (SMP), but when there are (at least) thousands, what does it matter who does what, even for a 2nd or 3rd time ?
If a system gets a new WU immediately, no time is lost.
If that system is unstable, almost any WU will crash on it; if a 2nd try is done on another system, chances of a complete finish are better ...
This same topic came up back in the days of V4 and the Pande Group has not responded to the enhancement requests that were made then.
The problem is - as I told before (here) - that SMP isn't like the v4 client or v5 client stuff; it has a much higher work speed, enforced by the very short deadlines !
Therefor, the need to fix this for SMP is much greater; the time proportions are very different with SMP, even more so with GPU(2), but in the latter case there seems to be no problem with assigning ans sending a WU within seconds of the return of the previous Wu results.


.

Re: Lost Time

Posted: Wed Jul 02, 2008 3:00 pm
by leexgx
but thats an known problem in the SMP client (checking first post as i understood this was related to time wasted when it starts Sending the work unit when it could be downloading new project)

----------------------------

the only times SMP was talked about was on the second post and later on in the 2 and 3 page guess i was not looking at each post as much, i am talking about when it Gets to the FINISHED_UNIT stage (no error)

me and bruce seem to be on the lines of when its sending the work unit

your still thinking about SMP thats more an limitation of the MPI that thay use, to me this topic is about that delay when it start sending unit , but i see its turnd more into the MPI SMP problem

i still do not see why its not so hard to do an FINISHED_UNIT trigger to make it start an download at same times as upload if unit compleated with no error, as it does this if the client is restarted