WU logged as failed but credited?

Moderators: Site Moderators, FAHC Science Team

Post Reply
arisu
Posts: 159
Joined: Mon Feb 24, 2025 11:11 pm

WU logged as failed but credited?

Post by arisu »

Project 18240 (Run 350, Clone 2, Gen 179) was finished and credited (including in the credits directory), but is logged as failed in my client database:

Code: Select all

sqlite> SELECT value FROM wu_log WHERE name IS '4zO4HWMjIQSk-IXkVNUU9eReI68maRhA8n53XWljCcE';
{
  "group": "",
  "number": 239,
  "cpus": 7,
  "gpus": [],
  "state": "UPLOAD",
  "paused": false,
  "id": "4zO4HWMjIQSk-IXkVNUU9eReI68maRhA8n53XWljCcE",
  "progress": 1,
  "assignment": {
    "time": "2025-03-31T13:11:17Z",
    "ws": "highland1.seas.upenn.edu",
    "port": 443,
    "project": 18240,
    "deadline": 432000,
    "timeout": 172800,
    "credit": 2733,
    "cpus": 7,
    "min_cpus": 1,
    "max_cpus": 64,
    "core": {"type": "0xa8", "url": "https://cores2.foldingathome.org/fahcore-a8-lin-64bit-avx2_256-0.0.12.tar.bz2", "sha256": "5d81e2df84fb0857e9804ce4c263e412d632b0e7c65bada542b608f1f8a2f204"}
  },
  "wu": {
    "time": "2025-03-31T13:11:17Z",
    "cs": ["158.130.118.25", "vav24.fah.temple.edu"],
    "server": 2659350039,
    "run": 350,
    "clone": 2,
    "gen": 179,
    "sha256": "bE9vlXiEon3lue9KkHtaaAKYkhZr/x88jkTvUFMuEso="
  },
  "wu_progress": 1,
  "eta": "0s",
  "ppd": 167520,
  "frames": 101,
  "run_time": 8628,
  "end_time": "2025-03-31T22:24:48Z",
  "result": "failed"
}
But it is logged by the server as having been credited correctly:

Code: Select all

$ curl https://api.foldingathome.org/project/18240/run/350/clone/2/gen/179 | python3 -m json.tool
[
    {
        "user": "Grimoire_of_Lolice",
        "team": 230362,
        "cpuid": "456F3C0C9570FC3A",
        "credit": 8589.82,
        "credit_time": "2025-03-31 22:21:22",
        "log_time": "2025-03-31 22:17:52",
        "assign_time": "2025-03-31 13:11:17",
        "days": 0.3796,
        "code": "Ok"
    }
]
According to the log, it tried to upload, but never got the message "credited" (messages from other WUs stripped):

Code: Select all

15:37:47:I1:WU239:Completed 242500 out of 250000 steps (97%)
15:39:20:I1:WU239:Completed 245000 out of 250000 steps (98%)
15:40:50:I1:WU239:Completed 247500 out of 250000 steps (99%)
15:42:19:I1:WU239:Completed 250000 out of 250000 steps (100%)
15:42:23:I1:WU239:Saving result file ../logfile_01.txt
15:42:23:I1:WU239:Saving result file frame179.gro
15:42:23:I1:WU239:Saving result file frame179.xtc
15:42:23:I1:WU239:Saving result file md.log
15:42:23:I1:WU239:Saving result file science.log
15:42:23:I1:WU239:Saving result file state.cpt
15:42:23:I1:WU239:Folding@home Core Shutdown: FINISHED_UNIT
15:42:23:I1:WU239:Core returned FINISHED_UNIT (100)
15:42:24:I1:WU239:Uploading WU results
15:42:24:I1:WU241:Requesting WU assignment for user Grimoire_of_Lolice team 230362
15:42:24:I1:WU239:UPLOAD 100% 85B of 85B
15:42:25:I2:OUT143:> POST /api/results HTTP/1.1
There are no more messages for that WU or connection in that log file. No success or error. No timeouts. It's like the connection just hung. I checked the client state with lufah and saw it stuck in an uploading state, so I restarted the client. The next log (messages from other WUs stripped):

Code: Select all

22:23:51:I1:WU239:Uploading WU results
22:23:52:I2:OUT1:> CONNECT highland1.seas.upenn.edu:443 HTTP/1.1
22:23:52:I1:WU239:UPLOAD 100% 85B of 85B
22:24:13:I2:OUT1:> POST /api/results HTTP/1.1
22:24:14:I1:WU239:UPLOAD 44% 8.25MiB of 18.82MiB
22:24:15:I1:WU239:UPLOAD 74% 13.88MiB of 18.82MiB
22:24:15:I1:WU239:UPLOAD 100% 18.81MiB of 18.82MiB
22:24:48:I2:OUT1:< HTTP/1.1 200 HTTP_OK
22:24:48:I1:WU239:Credited
Local log time says it was credited at 22:24 and server time says it was credited at 22:21. Close enough (my clock is probably fast). That's 7 hours of being stuck in state "uploading" and 7 hours of delay for a WU that took 2 hours to complete. Two questions:

1. Is this a client bug, where a connection can hang indefinitely with no timeout, and subsequent restart causes a false "failed"?
2. Because it clearly is credited, is it safe for me to edit the database to change the state? Like this (the query is correct):

Code: Select all

UPDATE wu_log SET value = REPLACE(value, "failed", "ok") WHERE name IS '4zO4HWMjIQSk-IXkVNUU9eReI68maRhA8n53XWljCcE';
Client version is 8.4.10 but there are no problematic changes to the networking code between that and 8.4.9.
muziqaz
Posts: 1408
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 9950x, 7950x3D, 5950x, 5800x3D
7900xtx, RX9070, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: WU logged as failed but credited?

Post by muziqaz »

GitHub.com ;)
FAH Omega tester
Image
arisu
Posts: 159
Joined: Mon Feb 24, 2025 11:11 pm

Re: WU logged as failed but credited?

Post by arisu »

I guess that answers question #1! I really should sign up to Github... Been avoiding it for philosophical reasons, but it's best I put that behind me.

I assume question #2 can be answered in the affirmative. It looks pretty clear-cut but I just wanted to make sure.
Joe_H
Site Admin
Posts: 8073
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4
Location: W. MA

Re: WU logged as failed but credited?

Post by Joe_H »

There has been a long standing problem with ACKs using HTTP or HTTPS and the clients over the years. The basic problem is that many ISPs consider them less important and may drop them going through their routers, especially under high network loads. In a browser you just hit refresh if something doesn't fully load on a web page. But the client has to watch for dropped ACKs. Most of the time it works, but then every so often there is a failure to detect this by either the client or the server and the connection may hang. Best I can say is that it happens to hang less often than it used to.
Image
arisu
Posts: 159
Joined: Mon Feb 24, 2025 11:11 pm

Re: WU logged as failed but credited?

Post by arisu »

Joe_H wrote: Tue Apr 01, 2025 6:36 am There has been a long standing problem with ACKs using HTTP or HTTPS and the clients over the years. The basic problem is that many ISPs consider them less important and may drop them going through their routers, especially under high network loads. In a browser you just hit refresh if something doesn't fully load on a web page. But the client has to watch for dropped ACKs. Most of the time it works, but then every so often there is a failure to detect this by either the client or the server and the connection may hang. Best I can say is that it happens to hang less often than it used to.
Something similar has happened with downloads, though that never caused a database bug like what might be happening here. By disconnecting and reconnecting to the network, the TCP connection breaks and the client receives EOF and retries.

I bet cbang is what handles the networking, so the solution would probably be to add to net/Socket.cpp something like:

Code: Select all

unsigned timeout = 30000; // 30 seconds
setsockopt(socket, SOL_TCP, TCP_USER_TIMEOUT, (void *)&timeout, sizeof(timeout));
The code already sets (or at least supports setting) SO_SNDTIMEO but maybe that's not enough? I just skimmed.

But yeah, github. :)
Post Reply