Blasphemous Cannibal wrote:Kris,
Cool, I'm pleased. New version installed, cheers
.
Can you explain in layman terms what's happening?
One of Langouste's goals is to kick in only when necessary.
Once client is done processing a WU we want to decouple upload/download processes.
However, there's no need for Langouste to kick in when client attempts to return a WU
from "autosend" context == it's already folding some other WU.
There's a function that attempts to determine number of FahCores running
"under" the client that's contacting Langouste.
If the client is running one or more FahCores, Langouste acts as a pass-through;
if it's not running any -- Langouste rejects the connection, creates a copy
of the client and attempts to return results with the copy (the usual thing).
Common operating systems let you check what parent process ID is (for a
given process) but there's no straightforward way to determine child processes.
Langouste does it backwards. It creates a list of FahCore_* processes and
for every FahCore_ process it walks (over parent processes) towards the "root"
of process tree. Note there might be FahCore_ processes that belong to
another client -- we don't want to count those.
Having said that -- if, during the walk, Langouste encounters the client that's
contacting Langouste then the client is returning a WU from "autosend" context
and there's no need to invoke Langouste's main feature. If Langouste reaches
the root then the FahCore belongs to another client (hence the message:
Code: Select all
[1282933431.640625] FahCore: FahCore_a3.exe (PID 2084) appears to belong to another client, continuing.
even though it's the GPU client that was returning a WU at the time).
The walk is realized with the loop that's broken the moment matching client
is found or the root is reached.
The problem is that some processes on your machine appear to (occasionally*)
have circular parent-child dependencies (you'll never reach the root == infinite loop).
*) that's what's really bogging my mind -- from your last log:
Good:
Code: Select all
[1282934045.500000] NtQueryInformationProcess() for pid 1548 failed: 3221225480 (0xc0000008). This _may_ be an error.
[1282934045.500000] Backlog:
[1282934045.500000] 1548
[1282934045.500000] 1632
[1282934045.500000] 3676
[1282934045.500000] 2084
[1282934045.500000] ================
[1282934045.500000] FahCore: FahCore_a3.exe (PID 2084) appears to belong to another client, continuing.
Notes:
1. Error C0000008 means that process 1548 is no longer there
2. PID 2084 is FahCore_a3.exe
Bad:
Code: Select all
(...)
[1282934049.031250] 3260
[1282934049.031250] 1548
[1282934049.031250] 1632
[1282934049.031250] 2720
[1282934049.031250] 4032
[1282934049.031250] 3260
[1282934049.031250] 1548
[1282934049.031250] 1632
[1282934049.031250] 2720
[1282934049.031250] 4032
[1282934049.031250] 3260
[1282934049.031250] 1548
[1282934049.031250] 1632
[1282934049.031250] 3676
[1282934049.031250] 2084
[1282934049.031250] ================
[1282934049.031250] FahCore: FahCore_a3.exe (PID 2084) appears to belong to another client, continuing.
Four seconds later: process 1548 is suddenly alive and has a parent (3260)... which has a parent (4032) which has a parent (2720) which has a parent (1632), parent of which is 1548... whoops
Now, Windows differs from Linux in two things (that are relevant to discussed issue):
1. PIDs are allocated using non-obvious algorithm (Windows) rather than sequentially (Linux); for instance, sequential creation of 10000 processes on (otherwise idle) Windows 7 64-bit results in use of only ~260 unique PIDs
(on Linux it's 10000)
2. There's no "mother-of-all-processes"; in other words, parent PID may point to long-dead process**; that is not possible on Linux -- if your parent dies, "init" (PID 1) becomes your parent
**) which (the PID) may (in future) be allocated to totally unrelated process
Those differences combined contribute to probability of PID collision that would result in the loop you've witnessed.
I have no explanation for high reproducibility of the issue on _your_ systems though.
I'll try to recreate the collision manually, just to see how hard it is...
On a side note -- one could argue that there's no need to walk complete process tree and that checking immediate parent (of a FahCore)
should be enough. That might be true now but when Langouste came to life (mpiexec et al.) it was not... Anyway, I would rather work
around the issue/improve the code rather than strip it. It might come handy in future.
Whew. Hope this helps.
Blasphemous Cannibal wrote:bc3 has fallen over on the XP rig again & the Win 7 machine are you interested in these logs still?
BC3 already served its purpose. But thank you.
Blasphemous Cannibal wrote:Let me know how you get on with the GTX 260, I'll consider providing remote access if necessary. Oh & enjoy Starcraft 2
.
Will do. And will do
Thanks,
Kris