Page 2 of 2
Re: general questions about WUs
Posted: Wed Feb 25, 2009 6:34 pm
by ^w^ing
alpha754293 wrote:
Well, see, I don't know which slots are good and which ones are bad. Not until/unless the core rolls over them.
I was just thinking of deleting pretty much the entire client directory, and then bringing in the fah6 and mpiexec back in, so that to the system, it'll look like it's starting fresh.
Heh, that would certainly fix it aswell, it just looks unnecessarily drastic to me
Anyways ยป what Bruce said. The client tells you in which slot the currently running WU is, so unless you have any unsent WU, you can just delete all the files in the work folder with other number in their names than the number of the current WU's slot.
Re: general questions about WUs
Posted: Wed Feb 25, 2009 6:47 pm
by alpha754293
Yea...that requires too much effort on my part. lol. Simple management is better, quick.
Besides, by doing it the other way, it would also purge the queue files as well in the (main) client directory, so if there are any problems with that, removing the entire client directory would also solve those issues as well, in order to avoid further complications down the line.
On the other hand, if there's like a site that explains what all the different files are and which ones I need to zap to clean up the system; I suppose I could do that manually.
(Maybe there's a much easier/quicker way of parsing through the FAHlog to see what's valid and what's not. Any ideas?)
Re: general questions about WUs
Posted: Wed Feb 25, 2009 7:31 pm
by ^w^ing
I dont see what you mean would help this issue if the queue file was deleted too, the queue file is fine even when this occurs, as the failed WU is deleted properly from the queue, only the work files are not being deleted and thats what causes these problems.
I dont know if theres any page in the fahwiki which would explain which files are what but you certainly dont need such a page for this. Look on the numbers in the file names, not on their extension. Look into your logfile in which slot the currently running WU is, it is printed out inbetween when the WU is downloaded and when it is started, for example (taken from one of your log files)
Code: Select all
[16:28:45] + Received work.
[16:28:45] Trying to send all finished work units
[16:28:45] + No unsent completed units remaining.
[16:28:45] + Closed connections
[16:28:45]
[16:28:45] + Processing work unit
[16:28:45] Core required: FahCore_a2.exe
[16:28:45] Core found.
[16:28:45] Working on queue slot 06 [February 24 16:28:45 UTC]
[16:28:45] + Working ...
In this case, you would delete all wudata_0x.* files, where x is any other number than 6, thus preserving all wudata_06.* files which would be the files of your currently running WU . It takes about 10 seconds, maybe less
Re: general questions about WUs
Posted: Wed Feb 25, 2009 7:51 pm
by MtM
Or use -oneunit and when the clients exits ( do check if the results are send, oneunit treis 3 times iirc before exiting even without uploading results so this is important ) and just delete queue.dat and the work folder completely. As Bruce said there might be other wu's not send yet, so use -qeueinfo to find that out, if there are unsent units left try -sendall or -send x where x is the slot number. It would be a waiste to delete completed but not uploaded entries.
Bruce wrote:All slots are good or bad depending entirely on which files were not cleaned up after some earlier EUE. There is always one active slot and when that WU is running, don't mess with it. There MAY also be completed WUs that have not yet been uploaded . . . containing a file known as "wuresults_*.dat" Any files with other numbers after the underscore are unnecessary.
I'm not entirely sure I get your point? You're saying other files with diffrent number then the current slot are unnecessary but while you mentioned unsent units you didn't really couple those together, wouldn't it be better phrased as 'Any files with a diffrent number then the current active slot
and those of wu's not yet uploaded are unnecessary.'? Semantics, sorry to be a nitpick but it seemed a pretty important one
Re: general questions about WUs
Posted: Wed Feb 25, 2009 11:18 pm
by bruce
MtM wrote:I'm not entirely sure I get your point? You're saying other files with diffrent number then the current slot are unnecessary but while you mentioned unsent units you didn't really couple those together, wouldn't it be better phrased as 'Any files with a diffrent number then the current active slot
and those of wu's not yet uploaded are unnecessary.'? Semantics, sorry to be a nitpick but it seemed a pretty important one
Well, here's an example that has two *_05* files. The example should be clearer than any semantics either of us would use.
KEY to colors:
- Current WU
Leftover -- can be deleted
Ready to upload. DO NOT DELETE
Don't worry about these.
core78.sta
current.xyz
logfile_00.txt
logfile_01.txt
logfile_02.txt
logfile_03.txt
logfile_04.txt
logfile_05.txt
logfile_06.txt
logfile_09.txt
wudata_03.chk
wudata_03.goe
wudata_03.pdo
wudata_06.arc
wudata_06.bed
wudata_06.bxv
wudata_06.chk
wudata_06.dat
wudata_06.goe
wudata_06.pdo
wudata_06.sas
wudata_06.xtc
wudata_06.xvg
wudata_06.xyz
wudata_060.log
wudata_061.log
wudata_062.log
wudata_063.log
wudata_06CP.arc
wudata_06CP.arc.b
wuinfo_06.dat
wuresults_05.dat
In this example, the only signs of an incomplete cleanup after an EUE is slot 03.
Re: general questions about WUs
Posted: Wed Feb 25, 2009 11:40 pm
by MtM
Yup that's the best way to show it.
Slot 3 has a left over checkpoint file indicating EUE not being cleaned up - gotcha !
All .txt files not related to current slot are useless ( and will be overwritten when that slot is reused ) - gotcha
wuresults_0x.dat is a results file, and should be listed in the queue as being finished waiting to upload. - gotcha
Only thing left is to consider a corruption in the queue file itself. If the queue.dat status codes don't match the content of the work folder ( eg you got a wuresults_05.dat but the qeueu status is empty, would the unit run to 100% and then fail because it wouldn't overwrite an existing results file for instance? )
Getting abit OT maybe, this is something I think which is mentioned in the 'how to use qfix' thread ( not sure, just assuming it is ).
Re: general questions about WUs
Posted: Thu Feb 26, 2009 1:30 am
by bruce
You've listed the only case that I know of where the queue file can be considered corrupt -- when a WU finished but the core hung before the queue could be updated. Your only alternative to recover the wuresults_* is to run qfix. Deleting queue.dat and doing nothing will have the same result . . . the "corruption" has already been fixed, just not necessarily the way you want it to be fixed.
I cannot think of a situation where deleting queue.dat fixed anything.
Re: general questions about WUs
Posted: Thu Feb 26, 2009 8:20 am
by MtM
Yes that's what I thought, what I was wondering was, if the user doesn't notice, will in the above example slot 5 just be unusable, or will a new wu when finished overwrite the old result file? Or will cause the client to dismiss the new wu? Or does it check more then the status of the slot, eg it will avoid using slot 5 even when the status is empty because there is already a results file?
Re: general questions about WUs
Posted: Thu Feb 26, 2009 5:09 pm
by bruce
Slot 5 will be skipped, if necessary. The cleint will continue to try to upload the WU in slot 5 until it is successful or it expires. If it's still there when the next WU in slot 4 completes, the next WU will download to slot 6.
Although it's not likely to happen, if ten WUs are completed and none can be uploaded, the oldest one will be deleted so that a new WU can be downloaded.
Re: general questions about WUs
Posted: Thu Feb 26, 2009 5:57 pm
by MtM
Miscommunication I think. I ment the above example, but with the corrupted queue.dat indicating to the client that slot 5 is empty ( without having uploaded the wuresults_05.dat file ). So the client will not continue to try to upload as it thinks the slot is empty.
Would the client just use the slot, since the queue indicates the slot is empty? I would assume so. But what happens when that wu is done, will the new resultsfile overwrite the old one, or will it error out and dismiss the just folded wu because it doesn't want to overwrite an existing results file?
Re: general questions about WUs
Posted: Thu Feb 26, 2009 8:56 pm
by bruce
If the client thinks the slot is empty, technically, the wuresults file is already lost. We already talked about that. You can choose to use qfix but if you don't use it a new WU will eventually be downloaded because the slot is empty and queue.dat is not corrupt.
Re: general questions about WUs
Posted: Fri Feb 27, 2009 9:33 am
by MtM
Ok I was just wondering if it would lead to troubles if there is an existing results file which the client doesn't know about. Would be ashame if it started on that slot, proceeded to 100% and then error out since it for some reason might not overwrite the existing results file
Sorry for the confusion.