Page 1 of 1

Weird "exception: bad allocation" behaviour

Posted: Wed Dec 17, 2025 6:52 am
by Schrödinger's cat
My GPU has had a least a couple of these problems:

Code: Select all

04:35:08:WU00:FS00:0x27:Completed 0 out of 250000 steps (0%)
04:35:13:WU00:FS00:0x27:Checkpoint completed at step 0
04:35:24:ERROR:Receive error: 10053: An established connection was aborted by the software in your host machine.
04:35:48:WU00:FS00:0x27:Completed 2500 out of 250000 steps (1%)
04:36:08:ERROR:Receive error: 10053: An established connection was aborted by the software in your host machine.
04:36:26:WU00:FS00:0x27:Completed 5000 out of 250000 steps (2%)
04:36:46:ERROR:Receive error: 10053: An established connection was aborted by the software in your host machine.
04:37:01:WU00:FS00:0x27:Completed 7500 out of 250000 steps (3%)
04:37:35:WU00:FS00:0x27:Completed 10000 out of 250000 steps (4%)
04:37:56:ERROR:Receive error: 10053: An established connection was aborted by the software in your host machine.
04:38:09:WU00:FS00:0x27:Completed 12500 out of 250000 steps (5%)
04:38:16:WU00:FS00:0x27:Checkpoint completed at step 12500
04:38:50:WU00:FS00:0x27:Completed 15000 out of 250000 steps (6%)
04:39:24:WU00:FS00:0x27:Completed 17500 out of 250000 steps (7%)
04:39:40:ERROR:Receive error: 10053: An established connection was aborted by the software in your host machine.
04:39:58:WU00:FS00:0x27:Completed 20000 out of 250000 steps (8%)
04:40:14:ERROR:Receive error: 10053: An established connection was aborted by the software in your host machine.
04:40:32:WU00:FS00:0x27:Completed 22500 out of 250000 steps (9%)
04:40:48:ERROR:Receive error: 10053: An established connection was aborted by the software in your host machine.
04:41:06:WU00:FS00:0x27:Completed 25000 out of 250000 steps (10%)
04:41:14:WU00:FS00:0x27:Checkpoint completed at step 25000
04:41:34:ERROR:Receive error: 10053: An established connection was aborted by the software in your host machine.
04:41:48:WU00:FS00:0x27:Completed 27500 out of 250000 steps (11%)
04:42:10:ERROR:Receive error: 10053: An established connection was aborted by the software in your host machine.
04:42:23:WU00:FS00:0x27:Completed 30000 out of 250000 steps (12%)
04:42:57:WU00:FS00:0x27:Completed 32500 out of 250000 steps (13%)
04:43:32:WU00:FS00:0x27:Completed 35000 out of 250000 steps (14%)
04:44:06:WU00:FS00:0x27:Completed 37500 out of 250000 steps (15%)
04:44:13:WU00:FS00:0x27:Checkpoint completed at step 37500
04:44:48:WU00:FS00:0x27:Completed 40000 out of 250000 steps (16%)
04:45:22:WU00:FS00:0x27:Completed 42500 out of 250000 steps (17%)
04:45:39:ERROR:Receive error: 10053: An established connection was aborted by the software in your host machine.
04:45:57:WU00:FS00:0x27:Completed 45000 out of 250000 steps (18%)
04:46:14:ERROR:std::exception: bad allocation
04:46:21:ERROR:WU00:FS00:std::exception: bad allocation
04:46:23:ERROR:Receive error: 10053: An established connection was aborted by the software in your host machine.
04:46:30:WU00:FS00:0x27:Completed 47500 out of 250000 steps (19%)
Note that the problem just disappeared after 19%.

What I find confusing is that folding wasn't restarted after the bad allocation exception. Does anyone understand what's going on? Could it be related to RAM, as Folding@home data is saved to a RAM disk.

Re: Weird "exception: bad allocation" behaviour

Posted: Wed Dec 17, 2025 8:33 am
by muziqaz
RAM disk?
Please use conventional storage devices for FAH.

Re: Weird "exception: bad allocation" behaviour

Posted: Wed Dec 17, 2025 1:17 pm
by Schrödinger's cat
muziqaz wrote: Wed Dec 17, 2025 8:33 am RAM disk?
Please use conventional storage devices for FAH.
I've been using a RAM disk for years without any issues.

The only thing that has changed is my GPU, which can produce viewerFrame.json files that are bigger than 32MB, for which I increased the size of the RAM disk, and Dynamic Memory Allocation, which only uses as much RAM as the data on the disk requires, at any time. The RAM disk data is saved to my SSD before rebooting or shutting down.

I have only discovered the error twice, and it appears to be random.

Re: Weird "exception: bad allocation" behaviour

Posted: Wed Dec 17, 2025 4:49 pm
by muziqaz
Schrödinger's cat wrote: Wed Dec 17, 2025 1:17 pm
muziqaz wrote: Wed Dec 17, 2025 8:33 am RAM disk?
Please use conventional storage devices for FAH.
I've been using a RAM disk for years without any issues.

The only thing that has changed is my GPU, which can produce viewerFrame.json files that are bigger than 32MB, for which I increased the size of the RAM disk, and Dynamic Memory Allocation, which only uses as much RAM as the data on the disk requires, at any time. The RAM disk data is saved to my SSD before rebooting or shutting down.

I have only discovered the error twice, and it appears to be random.
Years or decades, it doesn't matter. You are still using volatile medium as your storage in scientific application, which requires total stability. The fact that you had to increase RAM disk size because of fah visualisation tells me you are running it at bleeding edge, of "just about ok".

Re: Weird "exception: bad allocation" behaviour

Posted: Thu Dec 18, 2025 4:53 am
by Schrödinger's cat
muziqaz wrote: Wed Dec 17, 2025 4:49 pm The fact that you had to increase RAM disk size because of fah visualisation tells me you are running it at bleeding edge, of "just about ok".
Hardly. I've seen viewerFrame.json files of 11 and 32 KB too.

Re: Weird "exception: bad allocation" behaviour

Posted: Thu Dec 18, 2025 9:53 am
by PaulTV
I see the bad allocation messages frequently, but only on jobs for project 18261. Which may or may not be related. To be sure I double checked my SSD health, and it is 100% fine.

Re: Weird "exception: bad allocation" behaviour

Posted: Thu Dec 18, 2025 10:59 am
by muziqaz
PaulTV wrote: Thu Dec 18, 2025 9:53 am I see the bad allocation messages frequently, but only on jobs for project 18261. Which may or may not be related. To be sure I double checked my SSD health, and it is 100% fine.
18261 requires 10GB of free RAM to get a request for it. Once received, if free RAM dips below required, you might get issues. This also reminds me, page file allocation is also important. Anyone who has it restricted to some specific value might run into issues

Re: Weird "exception: bad allocation" behaviour

Posted: Thu Dec 18, 2025 11:14 am
by PaulTV
muziqaz wrote: Thu Dec 18, 2025 10:59 am
PaulTV wrote: Thu Dec 18, 2025 9:53 am I see the bad allocation messages frequently, but only on jobs for project 18261. Which may or may not be related. To be sure I double checked my SSD health, and it is 100% fine.
18261 requires 10GB of free RAM to get a request for it. Once received, if free RAM dips below required, you might get issues. This also reminds me, page file allocation is also important. Anyone who has it restricted to some specific value might run into issues
I see those messages (also) at times when I'm peacefully asleep, and nothing else is running on the machine. Windows 11, 32 GB, 4 TB hard disk with plenty of free space and no special swap config. As only jobs for this project give those messages, and I don't see other strange behaviour, I don't expect it to be a hardware issue (like bad memory). It might be a glitch somewhere, and bad allocation messages are on my to ignore list... But given this thread was started because of those bad allocation messages, I chipped in :)

Re: Weird "exception: bad allocation" behaviour

Posted: Thu Dec 18, 2025 12:16 pm
by muziqaz
Oh, this is v7 client. I think more and more you will start seeing random things stop behaving like normal, since the client is no longer actively developed and supported. Infrastructure is being fine tuned for v8 client, and if things work on v7, they work, but if they don't there will not going to be any effort put in to fix it.

Re: Weird "exception: bad allocation" behaviour

Posted: Sat Dec 20, 2025 3:48 pm
by Schrödinger's cat
I can confirm that only 18261 work units cause this, and do so once every work unit:

Code: Select all

ERROR:std::exception: bad allocation
ERROR:WU00:FS00:std::exception: bad allocation
Project: 18261 (Run 155, Clone 2, Gen 22) is currently running and creating 31.7MB viewerFrame.json files, and using 11GB of RAM.

I've also seen a work unit use more than 15GB of RAM when first starting. I'm unsure if that was an 18261 work unit.

Re: Weird "exception: bad allocation" behaviour

Posted: Sat Dec 20, 2025 4:30 pm
by muziqaz
Schrödinger's cat wrote: Sat Dec 20, 2025 3:48 pm I can confirm that only 18261 work units cause this, and do so every time:

Code: Select all

ERROR:std::exception: bad allocation
ERROR:WU00:FS00:std::exception: bad allocation
Project: 18261 (Run 155, Clone 2, Gen 22) is currently running and creating 31.7MB viewerFrame.json files, and using 11GB of RAM.

I've also seen a work unit use more than 15GB of RAM when first starting. I'm unsure if that was an 18261 work unit.
Yes, when WUs are starting they peak RAM usage. Then, when finishing a WU, they kinda come close to the peak.

v8.5.5 curbs viewerFrame.json file sizes, too. This is to prevent memory leak in the client

Re: Weird "exception: bad allocation" behaviour

Posted: Mon Dec 22, 2025 6:27 pm
by toTOW

Code: Select all

04:46:14:ERROR:std::exception: bad allocation
04:46:21:ERROR:WU00:FS00:std::exception: bad allocation
This usually means the there's not enough system RAM and/or virtual memory available.
It could also be memory leaks from the client which affect biggest projects like 18261.

Those leaks are fixed in v8.5 clients.