Page 1 of 1

Linux CPU stall with corresponding "Clock Skew" warning.

Posted: Wed Sep 30, 2020 1:24 pm
by emf
Hey folks,

Ever since core 22 0.0.13 dropped on my machine, some WU's (mostly 13426 project) have some kind of weird bug. The GPUs (GTX 1060 6GB) go off into la-la land and the entire machine freezes. It _usually_ comes back with a message similar to:

Code: Select all

kernel: [3036338.798684] watchdog: BUG: soft lockup - CPU#3 stuck for 92s! [FahCore_22:11492]
and a corresponding message in the fahclient log like:

Code: Select all

WARNING:WU02:FS00:Detected clock skew (25 mins 00 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
Sometimes, if it's a short stall, it only prints a

Code: Select all

ERROR:Receive error: 110: Connection timed out
from losing the fahcontrol app socket from a different system.

Dunno what's going on with these WU's, but it's a mess. This machine doesn't have any power management crap and _was_ working pretty well until the 26th.
Is there anything I can do here to debug? (also, the machine is not running any CPU slots; just the two GPU slots, as it's an otherwise wimpy machine and it's all it can do to keep the GPU's fed.)

System info:

Code: Select all

12:37:45:****************************** FAHClient ******************************
12:37:45:        Version: 7.6.9
12:37:45:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
12:37:45:      Copyright: 2020 foldingathome.org
12:37:45:       Homepage: https://foldingathome.org/
12:37:45:           Date: Apr 17 2020
12:37:45:           Time: 18:11:26
12:37:45:       Revision: 398c2b17fa535e0cc6c9d10856b2154c32771646
12:37:45:         Branch: master
12:37:45:       Compiler: GNU 8.3.0
12:37:45:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
12:37:45:                 -funroll-loops -fno-pie
12:37:45:       Platform: linux2 4.19.0-5-amd64
12:37:45:           Bits: 64
12:37:45:           Mode: Release
12:37:45:           Args: --child /etc/fahclient/config.xml --run-as fahclient
12:37:45:                 --pid-file=/var/run/fahclient.pid --daemon
12:37:45:         Config: /etc/fahclient/config.xml
12:37:45:******************************** CBang ********************************
12:37:45:           Date: Apr 17 2020
12:37:45:           Time: 18:10:13
12:37:45:       Revision: 2fb0be7809c5e45287a122ca5fbc15b5ae859a3b
12:37:45:         Branch: master
12:37:45:       Compiler: GNU 8.3.0
12:37:45:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
12:37:45:                 -funroll-loops -fno-pie -fPIC
12:37:45:       Platform: linux2 4.19.0-5-amd64
12:37:45:           Bits: 64
12:37:45:           Mode: Release
12:37:45:******************************* System ********************************
12:37:45:            CPU: Intel(R) Xeon(R) CPU E5430 @ 2.66GHz
12:37:45:         CPU ID: GenuineIntel Family 6 Model 23 Stepping 6
12:37:45:           CPUs: 4
12:37:45:         Memory: 31.41GiB
12:37:45:    Free Memory: 30.70GiB
12:37:45:        Threads: POSIX_THREADS
12:37:45:     OS Version: 4.15
12:37:45:    Has Battery: false
12:37:45:     On Battery: false
12:37:45:     UTC Offset: 0
12:37:45:            PID: 1551
12:37:45:            CWD: /var/lib/fahclient
12:37:45:             OS: Linux 4.15.0-109-generic x86_64
12:37:45:        OS Arch: AMD64
12:37:45:           GPUs: 2
12:37:45:          GPU 0: Bus:1 Slot:0 Func:0 NVIDIA:7 GP106 [GeForce GTX 1060 6GB] 4372
12:37:45:          GPU 1: Bus:5 Slot:0 Func:0 NVIDIA:7 GP106 [GeForce GTX 1060 6GB] 4372
12:37:45:  CUDA Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:6.1 Driver:10.2
12:37:45:  CUDA Device 1: Platform:0 Device:1 Bus:5 Slot:0 Compute:6.1 Driver:10.2
12:37:45:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:440.100
12:37:45:OpenCL Device 1: Platform:0 Device:1 Bus:5 Slot:0 Compute:1.2 Driver:440.100
12:37:45:******************************* libFAH ********************************
12:37:45:           Date: Apr 15 2020
12:37:45:           Time: 21:43:24
12:37:45:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
12:37:45:         Branch: master
12:37:45:       Compiler: GNU 8.3.0
12:37:45:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
12:37:45:                 -funroll-loops -fno-pie
12:37:45:       Platform: linux2 4.19.0-5-amd64
12:37:45:           Bits: 64
12:37:45:           Mode: Release
12:37:45:***********************************************************************

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Posted: Wed Sep 30, 2020 1:54 pm
by Neil-B
Are the drivers latest from vendor? (not sure of version numbers for linux drivers) if not might be worth updating just to discount ... might be related to an odd cuda issue where tries to start cuda folding and doesn't properly switch to opencl if not available iirc ... two identical GPUs might be causing the issue bug I'll let those who are GPU specialists give you a better diagnosis ... or maybe the new core is pushing thermals/power draw just to the edge of stability?

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Posted: Wed Sep 30, 2020 2:06 pm
by psaam0001
I think the latest *OFFICIAL NVidia* drivers to support his card is 450.66.

I know that there are newer drivers out there, but in the supported hardware notes suggest that they are specifically for the RTX 30xx series.

From Terminal (as root/super user), he can enter a "dnf update" (w/o the quote marks) command to see if newer supported drivers are found.

Paul

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Posted: Wed Sep 30, 2020 2:08 pm
by gunnarre
Try updating your FAHClient to version 7.6.13 and update the Nvidia drivers to the newest version (450 is the newest one). I'm successfully running dual GPUs on the Nvidia Server drivers, version 450.51.06-Ubuntu0.18.04.2. 450.66 is the newest regular driver.

Run the command "nvidia-smi" in the command line. You'll get something looking like this:

Code: Select all

nvidia-smi
Wed Sep 30 15:58:06 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 52%   68C    P0    N/A /  75W |    177MiB /  4040MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:1C:00.0 Off |                  N/A |
| 55%   71C    P2   180W / 200W |    237MiB /  8118MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11013      C   ...13/Core_22.fah/FahCore_22      175MiB |
|    1   N/A  N/A      5337      C   ...13/Core_22.fah/FahCore_22      235MiB |
+-----------------------------------------------------------------------------+
You can use the identifiers in the list to individually control the power target of the GPU, and reduce power to e.g. 100 watts: (Replace with the IDs from your listing:

Code: Select all

sudo nvidia-smi -i 00000000:1B:00.0 -pl 100
sudo nvidia-smi -i 00000000:1C:00.0 -pl 100
In case your stability was marginal before, CUDA folding might have pushed you over the edge to stability.

If that doesn't help, it might be a good idea to run a memory test on your system RAM, or check if the kernel needs updating.

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Posted: Wed Sep 30, 2020 3:56 pm
by emf
Trying the power limit on the GPU's idea now. Dropped to 100W from 120W max.

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Posted: Wed Sep 30, 2020 6:23 pm
by bruce
Some debugging code has been updated in FAHCore_22 and this is one that Development will want to look at. I don't see the PRCG numbers in your post. They can probably grep for it but having the numbers can't hurt.

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Posted: Thu Oct 01, 2020 2:47 am
by emf
So far with the power limit adjustment it hasn't griped, but i had a good >24h window between stalls yesterday, so I'm not 100% convinced. it hasn't hurt performance in any meaningful way, so that's good.

bruce: Due to having two cards in the system, it's hard to nail down which WU might be the one at fault, but I can take a guess at the most recent one this morning that caused me to hard powercycle the system.

first off i have three stalls ~9:00 UTC

Code: Select all

09:25:58:WARNING:WU01:FS00:Detected clock skew (1 mins 05 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
09:49:37:WARNING:WU01:FS00:Detected clock skew (1 mins 16 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
09:49:37:WARNING:WU00:FS01:Detected clock skew (1 mins 16 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
At 9:25, the system was:
uploading project:17400 run:0 clone:1087 gen:3,
just starting project:13426 run:6316 clone:4 gen:1, (had not even completed the 0% checkpoint)
and was at ~84% on project:13426 run:6041 clone:16 gen:3.

at 9:50 it was at 91% on 6041, and 7% on 6316, and the last message in any log is at 10:00 when it froze hard. The system was powercycled at 12:37 UTC and came back up and finished both without further issue; restarting from 90% and 5% checkpoints.

So, my guess would be that project:13426 run:6041 clone:16 gen:3 was the culprit in this event; assuming that it is actually a code problem and not a hardware stability problem as suggested above.

(i can track down other PCRG's in the same manner, or i can provide the whole fahclient log corpus if it helps)

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Posted: Thu Oct 01, 2020 7:07 am
by PantherX
I think that the 440 driver base is old so do upgrade to the latest ones to be sure. GeForce GTX 1060 is a Pascal GPU so should work without major issues on your system.

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Posted: Thu Oct 01, 2020 12:31 pm
by ipkh
The clock skew is a known issue with laptop chips and power states. I thought it was confined to sleep/hibernation but maybe there's an edge case causing this.
I'd make sure you have the latest Ubuntu updates.

But it's also possible you have a power/heat issue due to the Cuda efficiency gains. So maybe dust out the vents and whatnot.

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Posted: Thu Oct 01, 2020 6:28 pm
by gunnarre
You'll get a clock skew warning if the core froze for some reason. I've seen this happen on CPUs, and it would make sense that the same would happen if the whole machine freezes but wakes up again witout crashing completely.

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Posted: Fri Oct 02, 2020 3:35 pm
by bruce
...and the hardware can disable itself if it's getting too hot or it could be defective hardware, of course. Start by underclocking the candidate GPU and see if that stops the hangs.

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Posted: Mon Oct 05, 2020 8:21 pm
by emf
tl;dr - one of the CPU cores failed. Didn't have anything to do with the WU's or the GPU's or the drivers at all.

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Posted: Mon Oct 05, 2020 8:29 pm
by bruce
So "Defective Hardware."

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Posted: Tue Oct 06, 2020 7:08 am
by PantherX
emf wrote:...one of the CPU cores failed...
Out of curiosity, how did you figure that out?