Re: Lost Time
Posted: Mon Jun 30, 2008 10:13 am
.
Thanks !
I see that there 's improvement !
I hope that SMP gets fixed ...
.
Thanks !
I see that there 's improvement !
I hope that SMP gets fixed ...
.
.^w^ing wrote:Actually i think that only SMP has that prolonged delay between WUs because of all the cores need to shutdown and other things need to be cleaned up. Theres not such a long delay in the regular CPU clients either.
The point is, even if the client downloaded next WU before/right after a WU is finished, it couldnt start processing it right away, because the WU would eventually fail. The pause was prolonged on purpose, theres a reason why the SMP client doesnt do anything for a few minutes. For other clients, downloading before or whilst uploading, that would make sense because IT WOULD (probably) WORK. For the current SMP client, thats not true.1) The core needs some time (often as much as 4 minutes) between work units to finalize work and move on. It may also need some time AFTER you stop the client. Do not restart the client without checking if the copies of FahCore_a1 have stopped.
2) There is a brief pause (15-20 seconds) at the end of each WU. This is so we can make sure all the threads sync up. This is not a bug, as much as a limitation of SMP needing to synchronize the threads before moving on to the next WU.
Possibly the same as #1
.^w^ing wrote:I was refering to numbers 1 & 2 in the list of known SMP bugs and issues ( viewtopic.php?f=8&t=50 )
The point is, even if the client downloaded next WU before/right after a WU is finished, it couldnt start processing it right away, because the WU would eventually fail. The pause was prolonged on purpose, theres a reason why the SMP client doesnt do anything for a few minutes. For other clients, downloading before or whilst uploading, that would make sense because IT WOULD (probably) WORK. For the current SMP client, thats not true.1) The core needs some time (often as much as 4 minutes) between work units to finalize work and move on. It may also need some time AFTER you stop the client. Do not restart the client without checking if the copies of FahCore_a1 have stopped.
2) There is a brief pause (15-20 seconds) at the end of each WU. This is so we can make sure all the threads sync up. This is not a bug, as much as a limitation of SMP needing to synchronize the threads before moving on to the next WU.
Possibly the same as #1
edit: also, the current assignment system wouldnt let you download new WU before handing over your previous one. It would just send you the same WU again, just like when having a SMP EUE without sending back partial results.
Code: Select all
# SMP Client ##################################################################
###############################################################################
Folding@Home Client Version 6.02beta
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: /home/noorman/Folding@Home
Executable: ./fah6
Arguments: -smp -delete 02
[09:37:02] - Ask before connecting: No
[09:37:02] - User name: noorman (Team 734)
[09:37:02] - User ID: 48B83D25538777D9
[09:37:02] - Machine ID: 1
[09:37:02]
[09:37:03] Loaded queue successfully.
[09:37:03] Deleting work unit #2 from work queue...
[09:41:24] - Failed to delete the requested work unit
Folding@Home Client Shutdown.
--- Opening Log file [June 24 09:42:18]
# SMP Client ##################################################################
###############################################################################
Folding@Home Client Version 6.02beta
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: /home/noorman/Folding@Home
Executable: ./fah6
Arguments: -smp -verbosity 9
[09:42:18] - Ask before connecting: No
[09:42:18] - User name: noorman (Team 734)
[09:42:18] - User ID: 48B83D25538777D9
[09:42:18] - Machine ID: 1
[09:42:18]
[09:42:18] Loaded queue successfully.
[09:42:18] - Autosending finished units...
[09:42:18] Trying to send all finished work units
[09:42:18] + Attempting to send results
[09:42:18] - Reading file work/wuresults_02.dat from core
[09:42:18] (Read 5530530 bytes from disk)
[09:42:18] Connecting to http://171.64.65.56:8080/
[09:42:18] - Preparing to get new work unit...
[09:42:18] + Attempting to get work packet
[09:42:18] - Will indicate memory of 2014 MB
[09:42:18] - Detect CPU. Vendor: AuthenticAMD, Family: 15, Model: 3, Stepping: 2
[09:42:18] - Connecting to assignment server
[09:42:18] Connecting to http://assign.stanford.edu:8080/
[09:42:19] Posted data.
[09:42:19] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[09:42:19] + News From Folding@Home: Welcome to Folding@Home
[09:42:19] Loaded queue successfully.
[09:42:19] Connecting to http://171.64.65.56:8080/
[09:42:23] Posted data.
[09:42:23] Initial: 0000; - Receiving payload (expected size: 2444530)
[09:42:32] - Downloaded at ~265 kB/s
[09:42:32] - Averaged speed for that direction ~485 kB/s
[09:42:32] + Received work.
[09:42:32] + Closed connections
[09:42:32]
[09:42:32] + Processing work unit
[09:42:32] Core required: FahCore_a1.exe
[09:42:32] Core found.
[09:42:32] Working on Unit 03 [June 24 09:42:32]
[09:42:32] + Working ...
[09:42:32] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 03 -checkpoint 3 -verbose -lifeline 14561 -version 602'
[09:42:32]
[09:42:32] *------------------------------*
[09:42:32] Folding@Home Gromacs SMP Core
[09:42:32] Version 1.74 (November 27, 2006)
[09:42:32]
[09:42:32] Preparing to commence simulation
[09:42:32] - Ensuring status. Please wait.
[09:42:49] - Looking at optimizations...
[09:42:49] - Working with standard loops on this execution.
[09:42:49] - Previous termination of core was improper.
[09:42:49] - Going to use standard loops.
[09:42:49] - Files status OK
[09:42:50] - Expanded 2444018 -> 1290766- Starting from initial work packet
[09:42:50]
[09:42:50] Project: 2605 (Run 12, Clone 127, Gen 65)
[09:42:50]
[09:42:50] Entering M.D.
[09:42:50] ne 127, Gen 65)
[09:42:50]
[09:42:50] Entering M.D.
[09:42:57] les
[09:42:57] cal files
[09:42:57] in in POPC
[09:42:57] Writing local files
[09:42:57] Extra SSE boost OK.
[09:42:58] 0000 steps (0 percent)
[09:43:51] Posted data.
[09:43:51] Initial: 0000; - Uploaded at ~57 kB/s
[09:43:52] - Averaged speed for that direction ~57 kB/s
[09:43:52] + Results successfully sent
[09:43:52] Thank you for your contribution to Folding@Home.
[09:43:52] + Number of Units Completed: 2
[09:43:53] + Sent 1 of 1 completed units to the server
[09:43:53] - Autosend completed
[09:45:59] Timered checkpoint triggered.
[09:48:58] Timered checkpoint triggered.
[09:51:58] Timered checkpoint triggered.
[09:54:58] Timered checkpoint triggered.
[09:57:58] Timered checkpoint triggered.
[10:00:58] Timered checkpoint triggered.
[10:03:59] Timered checkpoint triggered.
[10:06:31] Writing local files
[10:06:31] Completed 5000 out of 500000 steps (1 percent)
[10:09:31] Timered checkpoint triggered.
[10:12:31] Timered checkpoint triggered.
[10:15:31] Timered checkpoint triggered.
[10:18:31] Timered checkpoint triggered.
[10:21:31] Timered checkpoint triggered.
[10:24:31] Timered checkpoint triggered.
[10:27:31] Timered checkpoint triggered.
[10:30:06] Writing local files
[10:30:06] Completed 10000 out of 500000 steps (2 percent)
[10:33:07] Timered checkpoint triggered.
[10:36:07] Timered checkpoint triggered.
[10:39:06] Timered checkpoint triggered.
[10:42:07] Timered checkpoint triggered.
[10:45:07] Timered checkpoint triggered.
[10:48:06] Timered checkpoint triggered.
[10:51:07] Timered checkpoint triggered.
[10:53:42] Writing local files
snip
.^w^ing wrote:Well if you had to repair the queue.dat, you had to shutdown the client, that also means shutting down the cores. So when you started the client, it was past the procedure for which the pause is needed and the client really could download new WU & upload results. And i did'nt say there is a link between uploading and downloading new WU without pause, the link is between the 'old' cores shutting down and the 'new' cores starting up without pause.
I'll have to step out tho, because i dont know what of the WinSMP issues applies to Linux SMP and I thought you were running the Win one. I still think that what i said applies even for the Linux SMP clients.
Lets call it magic. When the WU is finished, magic happens. When you kill the client after completion of the WU, thats the same magic. Then you use qfix and start up the client again. The magic has been done already, so no need for it anymore, your client sends your results and downloads new WU. Now it doesnt wait until sending the work, it downloads new WU immediately.noorman wrote:If I kill the client before it realises the WU has been finished (faulty queue.dat) and nothing has been sent, how then can there pass time needed for I don't know what (still) to happen that hasn't been instigated yet ?
If I then restart the Client and it does an upload of Results whilst connecting to the work server for a new WU, where has been done what ?
There hasn't been time to process anything ...
No, I was talking about what 7im said in the last post on first page of this topic. The assignment server doesnt need your finished WU for assigning you a new one, it just wants your finished WU before sending you the new one. This is how it works BEFORE you shutdown the client.noorman wrote:I compare it to the situation where the upload can't happen for some reason or another and the results are kept in queue; then when later on, the server comes back up and the result can be sent, where then has there been a process that links these Results to the currently running WU ???
Impossible !
There is no need for the Results to be processed before another WU can be launched because there are already WU's waiting to be assigned and the WU that gets its new data/parameters from my uploaded Results can be assigned to the next computer that logs on !
So, why should I be waiting for any processing that needs to be done on my Results ???
No need for that at all.
2 and 3 is not related to this and them 2 things relate on that the cores Must shut down all 4 cores before it can start sending1) Download of new work should precede upload or happen simultaneously whenever possible.
I just pointed out 2 possible solutions to the 4 minute pause ...bruce wrote:@Noorman
Please stay on a single topic. So far you've complained about at least three different issues which potentially have three different solutions.
1) Download of new work should precede upload or happen simultaneously whenever possible.
The same is the previous point; the FahCores are shutting down within seconds of the WU getting to 100%; it 's after that that the 4 minute idling takes place, not 0-4 mins !2) The SMP cores take anywhere from 0 to 4 minutes to shut down after the last message you see.
I was only using facts - found during the fixing of a faulty queue.dat (Status) entry - to demonstrate my findings concerning the pause that was not in effect during that repair sequence I performed; I demonstrated that an upload and download can run simultaneously, thus proving there was no need for the 4 minute delay/pause.3) Certain errors do delete WUs which can be recovered by qfix (most notably the SMP client)
There is nothing personal in this; I 'm only trying to get the necessary attention to the problem, which is every SMP Folder's problem & also that of the project.This is turning into your own personal gripe list about everything that's wrong with FAH and that's not constructive -- mainly because discussions about one topic can easily get overlooked when they're embedded in a thread about something else.
Nobody is saying that the issues don't need attention -- just that none of them are on an equal priority with other issues that are keeping development busy. People are also suggesting that you (and the rest of us, too) are not able to consider all of the pros and cons because we don't always see the "big picture" of FAH and all the possible interactions of a single change.
When the guys in development get "free time" (ya, sure) they generally reconsider everything that's one the lower priority list and work on whatever will make the best scientific improvements.
As far as fixing SMP for Windows is concerned, it's one of several high priority issues, but there's a prerequisite. The Pande Group has to have access to a version of MPI that's both dependable and fast. Until that happens, the emphasis will be on other high priority tasks.
True.leexgx wrote:this is an problem that is for all clients not just SMP
.bruce wrote:True.leexgx wrote:this is an problem that is for all clients not just SMP
This same topic came up back in the days of V4 and the Pande Group has not responded to the enhancement requests that were made then. (Technically, it's not a bug, it's a request for an enhancement. -- Bugs are things that do not work. In this case the client works as designed.)
The client has always been able to upload and download at the same time. It just doesn't do it during the first upload attempt.
One of the V4 bugs that was fixed long ago:
Suppose the upload fails. The client leaves the WU in the upload queue to be retried later and moves on to download a new WU. Suppose that in the few seconds between the failed upload attempt and the new download request, the server get fixed. You used to get the same WU reassigned. New logic was added so that the same WU was NOT reassigned as long as the WU is in the upload queue.
They'll need to make sure that the same logic applies if the upload and download happen simultaneously.
The problem is - as I told before (here) - that SMP isn't like the v4 client or v5 client stuff; it has a much higher work speed, enforced by the very short deadlines !This same topic came up back in the days of V4 and the Pande Group has not responded to the enhancement requests that were made then.