GPU WU stuck in assign wait prevents downloading of CPU WU

arisu · Post by **arisu** » Sat Apr 12, 2025 4:16 am

Has anyone else experienced this behavior? Is this a known issue?

When a CPU WU is running and a GPU WU is stuck waiting for an assignment, the client asks for a WU that is fit for a system with 0 CPUs and one GPU. If there's nothing available for that GPU then this just repeats with exponential back-off. That is fine. But when the CPU WU finishes, the next request for a WU should indicate to the server that CPUs and a GPU is available. But it doesn't. It still says there are 0 CPU cores free. So it gets stuck indefinitely, asking for a GPU WU over and over and never a CPU WU. The only solution is to pause and unpause.

Here are the logs with log level 3 with debugging enabled. The log-domain levels:

Code: Select all

Group:d:1,App:d:0,SSLContext:d:5,Thread:i:5,RWLock:d:5,TarFileReader:d:5,Subprocess:d:0,SubprocessPool:d:5

Code: Select all

10:48:09:I1:TailFileToLog:WU330:Completed 245000 out of 250000 steps (98%)
10:48:58:I1:TailFileToLog:WU330:Completed 247500 out of 250000 steps (99%)
10:49:48:I1:TailFileToLog:WU330:Completed 250000 out of 250000 steps (100%)
10:49:49:I1:TailFileToLog:WU330:Saving result file ../logfile_01.txt
10:49:49:I1:TailFileToLog:WU330:Saving result file frame210.gro
10:49:49:I1:TailFileToLog:WU330:Saving result file frame210.xtc
10:49:49:I1:TailFileToLog:WU330:Saving result file md.log
10:49:49:I1:TailFileToLog:WU330:Saving result file science.log
10:49:49:I1:TailFileToLog:WU330:Saving result file state.cpt
10:49:49:I1:TailFileToLog:WU330:Folding@home Core Shutdown: FINISHED_UNIT
10:49:50:I1:Unit:WU330:Core returned FINISHED_UNIT (100)
10:49:50:D1:Group:Default:Remaining CPUs: 16, Remaining GPUs: 0, Active WUs: 1
10:49:50:I1:Unit:WU330:Uploading WU results
10:49:51:I1:Request:OUT29:> CONNECT highland1.seas.upenn.edu:443 HTTP/1.1
10:49:51:I1:Unit:WU330:UPLOAD 100% 85B of 85B
10:49:51:I1:Request:OUT29:> POST https://highland1.seas.upenn.edu/api/results HTTP/1.1
10:49:57:I1:Unit:WU330:UPLOAD 100% 18.80MiB of 18.80MiB
10:50:27:I1:Request:OUT29:< HTTP/1.1 200 HTTP_OK
10:50:27:I1:Unit:WU330:Credited
10:50:27:D3:Unit:WU330:Cleaning WU
10:50:27:D1:Group:Default:Remaining CPUs: 16, Remaining GPUs: 0, Active WUs: 1
10:53:12:I1:Unit:WU331:Requesting WU assignment for user Grimoire_of_Lolice team 230362
10:53:12:D3:Unit:WU331:{
10:53:12:D3:Unit:WU331:  "time": "2025-04-11T10:53:12Z",
10:53:12:D3:Unit:WU331:  "wu": 331,
10:53:12:D3:Unit:WU331:  "version": "8.4.9",
10:53:12:D3:Unit:WU331:  "id": "[removed]",
10:53:12:D3:Unit:WU331:  "user": "Grimoire_of_Lolice",
10:53:12:D3:Unit:WU331:  "team": 230362,
10:53:12:D3:Unit:WU331:  "passkey": "[removed]",
10:53:12:D3:Unit:WU331:  "os": {"version": "6.1", "type": "linux", "memory": 18570428416},
10:53:12:D3:Unit:WU331:  "project": {"cause": "parkinsons", "beta": true},
10:53:12:D3:Unit:WU331:  "resources": {
10:53:12:D3:Unit:WU331:    "cpu": {"cpu": "amd64", "cpus": 0, "vendor": "AuthenticAMD", "signature": 10948417, "family": 25, "model": 116, "stepping": 1, "features": 9149117666224241663, "extended": 18119809652725673, "80000001": 8485406217078045695},
10:53:12:D3:Unit:WU331:    "gpu:03:00:00": {
10:53:12:D3:Unit:WU331:      "gpu": "amd",
10:53:12:D3:Unit:WU331:      "vendor": 4098,
10:53:12:D3:Unit:WU331:      "device": 5567,
10:53:12:D3:Unit:WU331:      "opencl": {"platform": 0, "device": 0, "compute": "2.0", "driver": "3625.0"}
10:53:12:D3:Unit:WU331:    }
10:53:12:D3:Unit:WU331:  }
10:53:12:D3:Unit:WU331:}
10:53:12:I1:Request:OUT36:> CONNECT assign1.foldingathome.org:443 HTTP/1.1
10:53:13:I1:Request:OUT36:> POST https://assign1.foldingathome.org/api/assign HTTP/1.1
10:53:14:I1:Request:OUT36:< HTTP/1.1 503 HTTP_SERVICE_UNAVAILABLE
10:53:14:E :Request:OUT36:HTTP_SERVICE_UNAVAILABLE: {"error":"No appropriate assignment"}
10:53:14:I1:Unit:WU331:Retry #18 in 8 mins 32 secs
10:53:28:D3:Remote:msg: {"state": "pause", "cmd": "state", "time": "2025-04-11T10:53:28Z"}
10:53:28:I1:Groups:Machine state pause
10:53:28:D3:Unit:WU331:Cleaning WU
10:53:35:D3:Remote:msg: {"state": "fold", "cmd": "state", "time": "2025-04-11T10:53:35Z"}
10:53:35:I1:Groups:Machine state fold
10:53:35:D1:Group:Default:Remaining CPUs: 16, Remaining GPUs: 1, Active WUs: 0
10:53:35:I1:Group:Default:Added new work unit: cpus:16 gpus:gpu:03:00:00
10:53:35:D1:Group:Default:Remaining CPUs: 0, Remaining GPUs: 0, Active WUs: 1
10:53:35:I1:Unit:WU332:Requesting WU assignment for user Grimoire_of_Lolice team 230362
10:53:35:D3:Unit:WU332:{
10:53:35:D3:Unit:WU332:  "time": "2025-04-11T10:53:35Z",
10:53:35:D3:Unit:WU332:  "wu": 332,
10:53:35:D3:Unit:WU332:  "version": "8.4.9",
10:53:35:D3:Unit:WU332:  "id": "[removed]",
10:53:35:D3:Unit:WU332:  "user": "Grimoire_of_Lolice",
10:53:35:D3:Unit:WU332:  "team": 230362,
10:53:35:D3:Unit:WU332:  "passkey": "[removed]",
10:53:35:D3:Unit:WU332:  "os": {"version": "6.1", "type": "linux", "memory": 18399158272},
10:53:35:D3:Unit:WU332:  "project": {"cause": "parkinsons", "beta": true},
10:53:35:D3:Unit:WU332:  "resources": {
10:53:35:D3:Unit:WU332:    "cpu": {"cpu": "amd64", "cpus": 16, "vendor": "AuthenticAMD", "signature": 10948417, "family": 25, "model": 116, "stepping": 1, "features": 9149117666224241663, "extended": 18119809652725673, "80000001": 8485406217078045695},
10:53:35:D3:Unit:WU332:    "gpu:03:00:00": {
10:53:35:D3:Unit:WU332:      "gpu": "amd",
10:53:35:D3:Unit:WU332:      "vendor": 4098,
10:53:35:D3:Unit:WU332:      "device": 5567,
10:53:35:D3:Unit:WU332:      "opencl": {"platform": 0, "device": 0, "compute": "2.0", "driver": "3625.0"}
10:53:35:D3:Unit:WU332:    }
10:53:35:D3:Unit:WU332:  }
10:53:35:D3:Unit:WU332:}
10:53:35:I1:Request:OUT39:> CONNECT assign2.foldingathome.org:443 HTTP/1.1
10:53:35:I1:Request:OUT39:> POST https://assign2.foldingathome.org/api/assign HTTP/1.1
10:53:37:I1:Request:OUT39:< HTTP/1.1 200 HTTP_OK
10:53:37:I1:Unit:WU332:Received WU assignment rdgI9KPYZX_I5r9yCw9J-VdPg6LlkYE6IdLHQZbbZE4
10:53:37:D3:Unit:WU332:Received assignment for 16 cpus and 1 gpus
10:53:37:I1:Unit:WU332:Downloading WU
10:53:37:D1:Group:Default:Remaining CPUs: 0, Remaining GPUs: 0, Active WUs: 1
10:53:37:I1:Request:OUT40:> CONNECT vav21.fah.temple.edu:443 HTTP/1.1
10:53:38:I1:Request:OUT40:> POST https://vav21.fah.temple.edu/api/assign HTTP/1.1
10:53:40:I1:Unit:WU332:DOWNLOAD 20% 143.16KiB of 733.30KiB
10:53:41:I1:Unit:WU332:DOWNLOAD 68% 501.05KiB of 733.30KiB
10:53:42:I1:Request:OUT40:< HTTP/1.1 200 HTTP_OK
10:53:42:I1:Unit:WU332:Received WU P18496 R50 C3 G149

See how it says "Remaining CPUs: 16, Remaining GPUs: 0, Active WUs: 1" (even though there are no active WUs) and yet the request to the assignment server says "cpus: 0" instead? But after pausing and unpausing, it changes to "Remaining CPUs: 16, Remaining GPUs: 1, Active WUs: 0" and finally sends "cpus: 16" to the server and picks up a CPU WU.

I'm trying to find out if anyone else has been bitten by this before or if there might be something odd about my setup. I lost about 14 hours of work because of this. Once the CPU WU finished and uploaded, it asked for a GPU WU over and over (always being denied which is fine), and never asking for a CPU WU until I noticed and paused/unpaused.

arisu · Post by **arisu** » Sat Apr 12, 2025 6:45 am

Just in case anyone else is bitten by this too and it isn't just me, here is the temporary solution I am using. It's a script that checks if the client has only one CPU WU that is stuck in assign wait and sends a pause and unpause command if it is. It only works with one resource group and assumes that the CPU WU can never genuinely be stuck in assign wait, so it's not a long-term solution.

Code: Select all

#!/usr/bin/env python3

import websocket, json
import datetime as dt
from time import sleep

host = 'localhost'
port = 7396

def new_state(state: str) -> dict:
        return json.dumps({
                'state': state,
                'cmd': 'state',
                'time': dt.datetime.now(dt.timezone.utc).isoformat().replace('+00:00', 'Z')
        })

def restart_client(ws: websocket.WebSocket) -> None:
        ws.send(new_state('pause'))
        ws.send(new_state('fold'))

def iteration(ws: websocket.WebSocket) -> None:
        msg = json.loads(ws.recv())

        if type(msg) is list and msg[0] == 'units':
                if msg[2] == 'start_time':
                        iteration.restarts = 0
                elif msg[1] == 0 and msg[2] == 'retries' and msg[3] >= 5:
                        if iteration.restarts >= 10:
                                raise Exception('too many unsuccessful restart attempts')

                        restart_client(ws)
                        iteration.restarts += 1

try:
        try:
                ws = websocket.create_connection(f'ws://{host}:{port}/api/websocket')
        except (ConnectionRefusedError, websocket.WebSocketException) as e:
                e.add_note('client may not be running')
                raise

        groups, units = map(json.loads(ws.recv()).get, ['groups', 'units'])

        if len(groups) != 1 or '' not in groups:
                raise Exception('multiple groups not supported')
        if groups['']['config']['cpus'] <= 2:
                raise Exception('more than two cpu threads must be configured')

        if len(units) == 1 and units[0]['state'] == 'ASSIGN' and units[0]['retries'] >= 5:
                restart_client(ws)

        iteration.restarts = 0
        while True:
                iteration(ws)
except KeyboardInterrupt:
        if 'ws' in locals() and ws.connected:
                ws.close(websocket.STATUS_NORMAL)
        exit(2)
except websocket.WebSocketConnectionClosedException:
        print('connection lost')
        exit(1)

If this is a known issue and it's not just me then I'll write a bug report on GitHub.

muziqaz · Post by **muziqaz** » Sat Apr 12, 2025 9:10 am

Ehm, word of advice, please use normal fahclient when posting here with questions. Anything posted from debug version of fahclient should go to github, as debug version is not meant for volunteers to decipher, it is for debug for the devs.
It is hard enough to go through v8 logs as is, and now you post the log which is a complete alien language with all this extra stuff in it.
When posting this on github, please provide your client set up, are you using resource groups, etc, etc

arisu · Post by **arisu** » Sat Apr 12, 2025 9:27 am

I enabled debug mode because it shows information that is otherwise hidden and is useful here. But I can turn it off, it just would be less information. I guess without the color coding, it does make it a lot harder to skim the logs.

I hoped someone here could say if they ran into the same issue before, before writing anything on GitHub. No resource groups other than default btw. All 16 threads enabled. GPU enabled.

Non-debug version:

Code: Select all

10:48:09:I1:TailFileToLog:WU330:Completed 245000 out of 250000 steps (98%)
10:48:58:I1:TailFileToLog:WU330:Completed 247500 out of 250000 steps (99%)
10:49:48:I1:TailFileToLog:WU330:Completed 250000 out of 250000 steps (100%)
10:49:49:I1:TailFileToLog:WU330:Saving result file ../logfile_01.txt
10:49:49:I1:TailFileToLog:WU330:Saving result file frame210.gro
10:49:49:I1:TailFileToLog:WU330:Saving result file frame210.xtc
10:49:49:I1:TailFileToLog:WU330:Saving result file md.log
10:49:49:I1:TailFileToLog:WU330:Saving result file science.log
10:49:49:I1:TailFileToLog:WU330:Saving result file state.cpt
10:49:49:I1:TailFileToLog:WU330:Folding@home Core Shutdown: FINISHED_UNIT
10:49:50:I1:Unit:WU330:Core returned FINISHED_UNIT (100)
10:49:50:I1:Unit:WU330:Uploading WU results
10:49:51:I1:Request:OUT29:> CONNECT highland1.seas.upenn.edu:443 HTTP/1.1
10:49:51:I1:Unit:WU330:UPLOAD 100% 85B of 85B
10:49:51:I1:Request:OUT29:> POST https://highland1.seas.upenn.edu/api/results HTTP/1.1
10:49:57:I1:Unit:WU330:UPLOAD 100% 18.80MiB of 18.80MiB
10:50:27:I1:Request:OUT29:< HTTP/1.1 200 HTTP_OK
10:50:27:I1:Unit:WU330:Credited
10:53:12:I1:Unit:WU331:Requesting WU assignment for user Grimoire_of_Lolice team 230362
10:53:12:I1:Request:OUT36:> CONNECT assign1.foldingathome.org:443 HTTP/1.1
10:53:13:I1:Request:OUT36:> POST https://assign1.foldingathome.org/api/assign HTTP/1.1
10:53:14:I1:Request:OUT36:< HTTP/1.1 503 HTTP_SERVICE_UNAVAILABLE
10:53:14:E :Request:OUT36:HTTP_SERVICE_UNAVAILABLE: {"error":"No appropriate assignment"}
10:53:14:I1:Unit:WU331:Retry #18 in 8 mins 32 secs
10:53:28:I1:Groups:Machine state pause
10:53:35:I1:Groups:Machine state fold
10:53:35:I1:Group:Default:Added new work unit: cpus:16 gpus:gpu:03:00:00
10:53:35:I1:Unit:WU332:Requesting WU assignment for user Grimoire_of_Lolice team 230362
10:53:35:I1:Request:OUT39:> CONNECT assign2.foldingathome.org:443 HTTP/1.1
10:53:35:I1:Request:OUT39:> POST https://assign2.foldingathome.org/api/assign HTTP/1.1
10:53:37:I1:Request:OUT39:< HTTP/1.1 200 HTTP_OK
10:53:37:I1:Unit:WU332:Received WU assignment rdgI9KPYZX_I5r9yCw9J-VdPg6LlkYE6IdLHQZbbZE4
10:53:37:I1:Unit:WU332:Downloading WU
10:53:37:I1:Request:OUT40:> CONNECT vav21.fah.temple.edu:443 HTTP/1.1
10:53:38:I1:Request:OUT40:> POST https://vav21.fah.temple.edu/api/assign HTTP/1.1
10:53:40:I1:Unit:WU332:DOWNLOAD 20% 143.16KiB of 733.30KiB
10:53:41:I1:Unit:WU332:DOWNLOAD 68% 501.05KiB of 733.30KiB
10:53:42:I1:Request:OUT40:< HTTP/1.1 200 HTTP_OK
10:53:42:I1:Unit:WU332:Received WU P18496 R50 C3 G149

The log-domains part (like "TailFileToLog") isn't debug btw.

muziqaz · Post by **muziqaz** » Sat Apr 12, 2025 9:30 am

Internally we have not run into this issue, no

In your log it shows that you pause and resume almost immediately. Have you given CPU a bit more time to actually get itself together, clean up, shave, have a shower, have cup of coffee, then contact the server? Sometimes it does take all these steps for the client to start doing work again.
Is this within resource group or in simple mode?

I think I can create environment like this with one of my PCs, stand by.
sh*t I can't without resource groups

arisu · Post by **arisu** » Sat Apr 12, 2025 9:59 am

The fast pause and resume is intentional (actually it's from a script). It temporarily resolves the issue. But without the pause/resume, it continuously looks for a 0 CPU, 1 GPU WU.

Maybe a way to reproduce it is to spoof a GPU that will get no work. My guess is this would reproduce it:

- Spoof (or use) a GPU that won't get work
- Start a CPU WU
- Wait for CPU WU to finish (while the client keeps trying in vain to get a GPU project)

After it finishes, it will keep waiting 8 minutes 32 seconds (max exponential back-off), and it will only ask the server for GPU WUs (by specifying number of available CPUs as 0, when it should be specifying number of available CPUs as the maximum) when it SHOULD be asking for any WU that can fit into the "cpu:16 gpu:1" resources. It's as if it thinks the CPU WU is still running even when it's not.

muziqaz wrote: ↑Sat Apr 12, 2025 9:30 am I think I can create environment like this with one of my PCs, stand by.
sh*t I can't without resource groups

Create resource groups then delete them so they get moved to the default group? But I don't think it's necessary to create resource groups to replicate this. If anything, resource groups might be a solution (just not a good one for me because I would have to either always fold with one less thread than I have even if there's no GPU WU for weeks, or write a script to manually check if a GPU WU is running and if so, reduce the CPU WU threads by one).

muziqaz · Post by **muziqaz** » Sat Apr 12, 2025 11:08 am

You know that 1 thread, plus or minus changes absolutely nothing in final ppd output. Reducing threads allows cpu to boost higher with remaining threads, which offsets the lost threads. Yes you might lose some PPD if you cut thread count in half, but 1-2 threads here or there make no difference.
Now, if you don't mind, please disable the script, and let it run to generate a logfile which you can show to the dev on github.
Telling our dev without the proof that things don't work, will not get him to take it more serious

to be fair, posting this issue on github would be your next step, as you will not get anywhere with it here, regardless if anyone comes in and confirms they had this issue before or not

arisu · Post by **arisu** » Sat Apr 12, 2025 11:17 am

I've got plenty of logs that show it, more than 14 hours worth in fact. What I showed would have been repeating every 8:32 otherwise. I'll show that on GitHub, but first I'm going to see if I can fix the bug myself so I can get rid of the hacky script.

muziqaz · Post by **muziqaz** » Wed Apr 16, 2025 2:42 pm

https://github.com/FoldingAtHome/fah-cl ... issues/340

Folding Forum

GPU WU stuck in assign wait prevents downloading of CPU WU

GPU WU stuck in assign wait prevents downloading of CPU WU

Re: GPU WU stuck in assign wait prevents downloading of CPU WU

Re: GPU WU stuck in assign wait prevents downloading of CPU WU

Re: GPU WU stuck in assign wait prevents downloading of CPU WU

Re: GPU WU stuck in assign wait prevents downloading of CPU WU

Re: GPU WU stuck in assign wait prevents downloading of CPU WU

Re: GPU WU stuck in assign wait prevents downloading of CPU WU

Re: GPU WU stuck in assign wait prevents downloading of CPU WU

Re: GPU WU stuck in assign wait prevents downloading of CPU WU