Page 1 of 1
overheating
Posted: Sun Oct 12, 2025 12:08 pm
by jsm
Now that I have added the 32core threadripper Ubuntu to the account (viz entry re removing politk) it fell over overnight. I had to power cycle to get it back up. Suspecting overheating I dialled back the cpus tp 32 threads so half available and started watch sensors every 2 secs.
Even at this reduced cpu count the temps are in the 90C range.
How do i control the operation so that the system does not trip please>?
Re: overheating
Posted: Sun Oct 12, 2025 1:43 pm
by toTOW
Improve your system cooling.
Re: overheating
Posted: Sun Oct 12, 2025 9:16 pm
by jsm
I posted a reply straight away but it must have disappeared.............. so i am reposting
Noted re cooling: however this system has 5 clean fans and a large pumped loop with three radiators. Moreover it ran 60 threads continuously for Rosetta at Home with no heating problem.
AI suggested adding an option in config.xml of cpu-usage=50 which I have done and restarted client but temp still in 90s.
so running with 30 threads or half available and the option. Any offers?
Re: overheating
Posted: Mon Oct 13, 2025 12:32 am
by Joe_H
Rosetta at Home is not a comparable computation load.
AI suggested an option that only applied to v7, and only was meant for an early CPU core that hasn't been used in at least a decade.
It goes back to improving your cooling. Having the pumped loop is not a solution if you have the wrong hardware installed that does not meet the cooling needs of your Threadripper. Depending on which exact processor you have installed the hardware would need to to be rated for a TDP of 300 W or more.
Re: overheating
Posted: Mon Oct 13, 2025 12:22 pm
by jsm
Noting the suggestion that the hardware may not be 'up to scratch' may I comment that this threadripper 32 core was constructed for me by a specialist pc company with the specific instruction to incorporate cooling to permit safe operation of all 32 cores at 100% 24/7.
Thus it is in an EATX case to give room, has five large fans in addition to a very large rigid line pump system exhausting into three radiators. There is only one minimal gpu, one ssd, one hd, one optical drive.
I have opened up and ensured no dust, reservoir topped up, all connections firm. re-cleaned all fans.
I don't know what else I can do but am open to suggestions. Surely if FAH wants users with multi cpu machines to contribute there should be a way to handle temperature throttling?
jsm
Re: overheating
Posted: Mon Oct 13, 2025 2:39 pm
by Joe_H
That still doesn't mean your machine is up to running F@h 24/7, and I would suggest checking with the assembler. I have heard of more than one "specialist pc company" that has assembled mismatched components. Then you went with liquid cooling which has its own maintenance requirements to flush and clean the cooling loops and blocks to maintain thermal transfer. There may also need to be BIOS tweaks to the related fan and pump controls.
F@h has left temperature controls to the hardware and the user. Programming a common interface to deal with it for Windows, Linux, and macOS would be beyond what they want to encode into the client. You as the user can define how many CPU threads are used. v7 also had a published API by which external controls could be added. They plan to post a similar API for v8 as it gets finalized.
Re: overheating
Posted: Mon Oct 13, 2025 5:29 pm
by jsm
So its up to users to protect against temperature overheating. OK I have constructed a script to run under tcsh with limits of 95 down to 80 and an 100 second anti flap time if anybody using ubuntu or similar wants to try it:
#!/bin/tcsh
# === CONFIGURATION ===
set max_temp = 95
set resume_temp = 80
set check_interval = 60
set logfile = "$HOME/fah_thermal_guard.log"
set min_pause_duration = 100 # Seconds FAH must stay paused before resume
# === STATE TRACKING ===
set last_pause_time = 0
# === MAIN LOOP ===
set timestamp = `date +"%F %T"`
echo "$timestamp

FAH Thermal Guard started" >> $logfile
while (1)
set overheat = 0
set cooled = 1
# Get CPU Package temp
set pkg_temp = `sensors | grep -Ei 'Package id 0|Tctl|Tdie' | head -1 | grep -Eo '[0-9]{2,3}\.[0-9]'`
set timestamp = `date +"%F %T"`
echo "$timestamp

CPU Package Temp: $pkg_temp°C" >> $logfile
# Get relevant thermal readings
set temps = (`sensors | grep -Ei 'Package id 0|Tctl|Tdie' | grep -Eo '[0-9]{2,3}\.[0-9]'`)
set timestamp = `date +"%F %T"`
echo "$timestamp

Extracted thermal temps: $temps" >> $logfile
foreach t ($temps)
if (`echo "$t > $max_temp" | bc` == 1) then
set overheat = 1
endif
if (`echo "$t > $resume_temp" | bc` == 1) then
set cooled = 0
endif
end
# Pause if overheating
if ($overheat) then
set timestamp = `date +"%F %T"`
echo "$timestamp

Temp exceeded $max_temp°C — pausing FAHClient" >> $logfile
sudo systemctl stop fah-client
set last_pause_time = `date +%s`
else
# Check FAH status and resume if cooled and pause duration met
set fahstatus = `sudo systemctl is-active fah-client`
set now = `date +%s`
set elapsed = `echo "$now - $last_pause_time" | bc`
set timestamp = `date +"%F %T"`
echo "$timestamp

FAH status: '$fahstatus', cooled: $cooled, paused for: $elapsed sec" >> $logfile
if ("$cooled" == "1" && "$fahstatus" != "active" && "$elapsed" >= "$min_pause_duration") then
echo "$timestamp

Temps below $resume_temp°C and pause duration met — resuming FAHClient" >> $logfile
sudo systemctl start fah-client
set fahstatus_post = `sudo systemctl is-active fah-client`
set timestamp = `date +"%F %T"`
echo "$timestamp

FAH status after resume: '$fahstatus_post'" >> $logfile
endif
endif
sleep $check_interval
end
Re: overheating
Posted: Mon Oct 13, 2025 6:29 pm
by Joe_H
It is up to the users also to have a capable, stable system. The fact yours crashed from overheating means it was not as advertised by your builder, capable of running any process at full usage of the cores 24/7. So you are responsible for running within the limits of your hardware.
To give you an idea of the load F@h CPU folding is capable of, it has been used as parts of benchmarks and by more than a few operations to "burn in" new server installations. The heavy floating point usage can push CPUs to well over 90% of their capacity.
Recent Threadripper chips have been rated at 380 W, a previous generation was 280 W. The highest end chips can briefly draw more than 400 W. F@h will use all that it is allowed by the hardware. Your cooling system is either up to handling that or it is not.
As for your solution, essentially you are using the figurative hammer. It will end up continually cycling your system from hot to cold, and doing a hard interrupt instead of cleanly pausing the folding process. The actual CPU usage will be in a core process spawned by fah-client. It should exit when the fah-client process is stopped, but problems have been reported in the past.
There is the lufah utility which can be used to control folding such as pausing and starting the processing of WUs. There are other ways of handling this, ask about that instead of demanding F@h include such controls that almost no one needs. Or get your system fixed so it can actually do what you apparently paid for and did not get.
Re: overheating
Posted: Mon Oct 13, 2025 9:02 pm
by calxalot
Besides lufah, there is fahctl that is installed with the client on Linux and macOS.
It needs python 3.6+ and python module websocket-client. I think the apt pkg is python3-websocket
Re: overheating
Posted: Mon Oct 13, 2025 9:04 pm
by calxalot
Some other people are using cgroups on Linux to limit the cpu time. Please search this forum.
Re: overheating
Posted: Mon Oct 13, 2025 9:36 pm
by muziqaz
Forget 3rd party tools and scripts. Contact your system assembler and tell them that their system is poorly designed and assembled.
Re: overheating
Posted: Mon Oct 13, 2025 11:24 pm
by calxalot
Maybe just bad thermal paste job?
Re: overheating
Posted: Tue Oct 14, 2025 9:58 pm
by jeffmr4
Hi,
I have a 7950x by amd that usually runs at 95 degrees under load. If I check the spec site on amd it says that this is normal and it is meant to run that way 24/7. I wasn't totally comfortable with that so I initially installed liquidctl (some linux software that controls external devices like fans, controller hubs and aio coolers). I also installed coolercontrol. This includes an app you need to install to incorporate liquidctl into it. These will let you create dependencies for cooling such that as the temperature of your cpu increases so does your fan speed, aio pump speed, etc.
When I initially installed fah on my PC in linux the fans wouldn't spin up on their own when the system came under load. That is why I used the above software. I'm not sure how well linux controls cooling from one distribution to another.
I also undervolted my cpu in the bios, kind of like what you are doing with your script. In the bios you can undervolt by wattage or temperature. I did by wattage following a video I saw online. If you would like a link to that I can send it to you. This brought my temps down by 10 to 20 degrees C. Now my system maxes out at about 85 degrees instead of 95 which might be better for other components in the system not overheating.
I'd note that the coolercontrol software can also control the fans on your gpu.
Let us know what you decide. Good luck!
Re: overheating
Posted: Tue Oct 14, 2025 10:01 pm
by muziqaz
AMD chips will always max out at the max temp you either set, or default (whichever AMD sets). For 7000 series 95C is the max safe temp, and it will stay at that temp even with water cooling.
For 9000 series max temp allowed is 89-90C I believe. I have set mine to 89C. And even with best watercooling kits they are sitting at 88C when folding
Re: overheating
Posted: Thu Oct 16, 2025 2:53 pm
by Albuquerquefx
Despite the focus being on overheating and controlling the heat, it's also necessary to point out: if the system overheated so badly as to actually lock up, to the point of needing to be hard-powered-down? That's a builder problem, full stop. There should never be a situation where the physical system achieves a point of thermal overload (no matter what program may be running) where it should fully lock up.
Given the available information, I find it unlikely the "specialist" builder properly burn-in tested the completed system. A 32-core system is rather boutique, and so too should be the cooling. Keeping a workstation-class 300W of CPU power, along with (I have to assume) a minimum of six memory channels and probably twelve DIMMs all cool is going to require a lot of surface area and air movement. Water doesn't mean better cooling by itself, water simply means you're transporting the heat to (what you hope is) a larger heat exchanger than could be reasonably stuffed into a heatsink directly mounted to the CPU. That's it. At the end of the day, the heat needs to be moved out of the CPU to the air, water (at the enterprise scale, ethelyne glycol) just helps you move that air transition to another location...
Finally, I thought it might be useful for someone who isn't a member of the forum staff or the F&H team to comment (eg: me, since I"m not a member of the forum staff nor an employee of or vendor/supplier to F@H) since otherwise it might be seen as "defensive" replies. It isn't the F@H team being defensive; properly assembled machines should not permit themselves to overheat. If an application is capable of "making" a machine overheat, it's because the machine is built incorrectly, not because the application is too hard.