Page 1 of 1

repeated INTERRUPTED after 100%- 17322 (0, 993,21)

Posted: Thu Dec 24, 2020 2:57 pm
by Knish
This the cloud GCP nvidia T4 again, but now neither pause/unpausing nor rebooting clears the 17322. I saw someone else had a fault with this WU; could it be the WU? If my next WU completes and uploads ok, i may have to just dump the 17322

Code: Select all

*********************** Log Started 2020-12-24T14:24:39Z ***************
14:24:39:Trying to access database...
14:24:40:Successfully acquired database lock
14:24:40:Read GPUs.txt
14:24:43:Enabled folding slot 01: READY gpu:0:TU104GL [Tesla T4] 8141
14:24:43:****************************** FAHClient ***********************
14:24:43:        Version: 7.6.13
14:24:43:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
14:24:43:      Copyright: 2020 foldingathome.org
14:24:43:       Homepage: https://foldingathome.org/
14:24:43:           Date: Apr 28 2020
14:24:43:           Time: 04:20:16
14:24:43:       Revision: 5a652817f46116b6e135503af97f18e094414e3b
14:24:43:         Branch: master
14:24:43:       Compiler: GNU 8.3.0
14:24:43:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
14:24:43:                 -funroll-loops -fno-pie
14:24:43:       Platform: linux2 4.19.0-5-amd64
14:24:43:           Bits: 64
14:24:43:           Mode: Release
14:24:43:           Args: --child /etc/fahclient/config.xml --run-as fahclient
14:24:43:                 --pid-file=/var/run/fahclient.pid --daemon
14:24:43:         Config: /etc/fahclient/config.xml
14:24:43:******************************** CBang ***************************
14:24:43:           Date: Apr 25 2020
14:24:43:           Time: 00:07:53
14:24:43:       Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
14:24:43:         Branch: master
14:24:43:       Compiler: GNU 8.3.0
14:24:43:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
14:24:43:                 -funroll-loops -fno-pie -fPIC
14:24:43:       Platform: linux2 4.19.0-5-amd64
14:24:43:           Bits: 64
14:24:43:           Mode: Release
14:24:43:******************************* System *************************
14:24:43:            CPU: Intel(R) Xeon(R) CPU @ 2.30GHz
14:24:43:         CPU ID: GenuineIntel Family 6 Model 63 Stepping 0
14:24:43:           CPUs: 1
14:24:43:         Memory: 1.70GiB
14:24:43:    Free Memory: 1.45GiB
14:24:43:        Threads: POSIX_THREADS
14:24:43:     OS Version: 4.19
14:24:43:    Has Battery: false
14:24:43:     On Battery: false
14:24:43:     UTC Offset: 0
14:24:43:            PID: 444
14:24:43:            CWD: /var/lib/fahclient
14:24:43:             OS: Linux 4.19.0-13-cloud-amd64 x86_64
14:24:43:        OS Arch: AMD64
14:24:43:           GPUs: 1
14:24:43:          GPU 0: Bus:0 Slot:4 Func:0 NVIDIA:6 TU104GL [Tesla T4] 8141
14:24:43:  CUDA Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:7.5 Driver:11.0
14:24:43:OpenCL Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:1.2 Driver:450.51
14:24:43:******************************* libFAH **************************
14:24:43:           Date: Apr 15 2020
14:24:43:           Time: 21:43:24
14:24:43:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
14:24:43:         Branch: master
14:24:43:       Compiler: GNU 8.3.0
14:24:43:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
14:24:43:                 -funroll-loops -fno-pie
14:24:43:       Platform: linux2 4.19.0-5-amd64
14:24:43:           Bits: 64
14:24:43:           Mode: Release
14:24:43:****************************************************************
14:24:43:<config>
14:24:43:  <!-- Client Control -->
14:24:43:  <fold-anon v='true'/>

14:24:43:  <!-- User Information -->

14:24:43:  <!-- Folding Slots -->
14:24:43:  <slot id='1' type='GPU'/>
14:24:43:</config>
14:24:43:WU01:FS01:Starting
14:24:43:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 444 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
14:24:43:WU01:FS01:Started FahCore on PID 551
14:24:43:WU01:FS01:Core PID:555
14:24:43:WU01:FS01:FahCore 0x22 started
14:24:44:WU01:FS01:0x22:*********************** Log Started 2020-12-24T14:24:43Z ***********
14:24:44:WU01:FS01:0x22:*************************** Core22 Folding@home Core ************
14:24:44:WU01:FS01:0x22:       Core: Core22
14:24:44:WU01:FS01:0x22:       Type: 0x22
14:24:44:WU01:FS01:0x22:    Version: 0.0.13
14:24:44:WU01:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
14:24:44:WU01:FS01:0x22:  Copyright: 2020 foldingathome.org
14:24:44:WU01:FS01:0x22:   Homepage: https://foldingathome.org/
14:24:44:WU01:FS01:0x22:       Date: Sep 19 2020
14:24:44:WU01:FS01:0x22:       Time: 01:10:35
14:24:44:WU01:FS01:0x22:   Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
14:24:44:WU01:FS01:0x22:     Branch: core22-0.0.13
14:24:44:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
14:24:44:WU01:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
14:24:44:WU01:FS01:0x22:             -funroll-loops -DOPENMM_GIT_HASH="\"189320d0\""
14:24:44:WU01:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
14:24:44:WU01:FS01:0x22:       Bits: 64
14:24:44:WU01:FS01:0x22:       Mode: Release
14:24:44:WU01:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
14:24:44:WU01:FS01:0x22:             <peastman@stanford.edu>
14:24:44:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 706 -lifeline 551 -checkpoint 15
14:24:44:WU01:FS01:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
14:24:44:WU01:FS01:0x22:             0 -gpu 0
14:24:44:WU01:FS01:0x22:************************************ libFAH *******************
14:24:44:WU01:FS01:0x22:       Date: Sep 15 2020
14:24:44:WU01:FS01:0x22:       Time: 05:14:43
14:24:44:WU01:FS01:0x22:   Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
14:24:44:WU01:FS01:0x22:     Branch: HEAD
14:24:44:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
14:24:44:WU01:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
14:24:44:WU01:FS01:0x22:             -funroll-loops
14:24:44:WU01:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
14:24:44:WU01:FS01:0x22:       Bits: 64
14:24:44:WU01:FS01:0x22:       Mode: Release
14:24:44:WU01:FS01:0x22:************************************ CBang *******************
14:24:44:WU01:FS01:0x22:       Date: Sep 15 2020
14:24:44:WU01:FS01:0x22:       Time: 05:11:04
14:24:44:WU01:FS01:0x22:   Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
14:24:44:WU01:FS01:0x22:     Branch: HEAD
14:24:44:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
14:24:44:WU01:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
14:24:44:WU01:FS01:0x22:             -funroll-loops -fPIC
14:24:44:WU01:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
14:24:44:WU01:FS01:0x22:       Bits: 64
14:24:44:WU01:FS01:0x22:       Mode: Release
14:24:44:WU01:FS01:0x22:************************************ System ********************
14:24:44:WU01:FS01:0x22:        CPU: Intel(R) Xeon(R) CPU @ 2.30GHz
14:24:44:WU01:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 63 Stepping 0
14:24:44:WU01:FS01:0x22:       CPUs: 1
14:24:44:WU01:FS01:0x22:     Memory: 1.70GiB
14:24:44:WU01:FS01:0x22:Free Memory: 1.31GiB
14:24:44:WU01:FS01:0x22:    Threads: POSIX_THREADS
14:24:44:WU01:FS01:0x22: OS Version: 4.19
14:24:44:WU01:FS01:0x22:Has Battery: false
14:24:44:WU01:FS01:0x22: On Battery: false
14:24:44:WU01:FS01:0x22: UTC Offset: 0
14:24:44:WU01:FS01:0x22:        PID: 555
14:24:44:WU01:FS01:0x22:        CWD: /var/lib/fahclient/work
14:24:44:WU01:FS01:0x22:************************************ OpenMM *********************
14:24:44:WU01:FS01:0x22:   Revision: 189320d0
14:24:44:WU01:FS01:0x22:****************************************************************
14:24:44:WU01:FS01:0x22:Project: 17322 (Run 0, Clone 993, Gen 21)
14:24:44:WU01:FS01:0x22:Unit: 0x00000000000000000000000000000000
14:24:44:WU01:FS01:0x22:Digital signatures verified
14:24:44:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
14:24:44:WU01:FS01:0x22:Version 0.0.13
14:24:44:WU01:FS01:0x22:  Checkpoint write interval: 15000 steps (2%) [50 total]
14:24:44:WU01:FS01:0x22:  JSON viewer frame write interval: 7500 steps (1%) [100 total]
14:24:44:WU01:FS01:0x22:  XTC frame write interval: 250000 steps (33%) [3 total]
14:24:44:WU01:FS01:0x22:  Global context and integrator variables write interval: disabled
14:24:45:WU01:FS01:0x22:There are 4 platforms available.
14:24:45:WU01:FS01:0x22:Platform 0: Reference
14:24:45:WU01:FS01:0x22:Platform 1: CPU
14:24:45:WU01:FS01:0x22:Platform 2: OpenCL
14:24:45:WU01:FS01:0x22:  opencl-device 0 specified
14:24:45:WU01:FS01:0x22:Platform 3: CUDA
14:24:45:WU01:FS01:0x22:  cuda-device 0 specified
14:25:04:WU01:FS01:0x22:Attempting to create CUDA context:
14:25:05:WU01:FS01:0x22:  Configuring platform CUDA
14:25:23:WU01:FS01:0x22:  Using CUDA and gpu 0
14:25:24:WU01:FS01:0x22:Completed 750000 out of 750000 steps (100%)
14:25:29:WU01:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
14:25:29:WU00:FS01:Starting
14:25:29:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 00 -suffix 01 -version 706 -lifeline 444 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
14:25:29:WU00:FS01:Started FahCore on PID 567
14:25:29:WU00:FS01:Core PID:571
14:25:29:WU00:FS01:FahCore 0x22 started
14:25:30:WU00:FS01:0x22:*********************** Log Started 2020-12-24T14:25:30Z *********
14:25:30:WU00:FS01:0x22:*************************** Core22 Folding@home Core *************
14:25:30:WU00:FS01:0x22:       Core: Core22
14:25:30:WU00:FS01:0x22:       Type: 0x22
14:25:30:WU00:FS01:0x22:    Version: 0.0.13
14:25:30:WU00:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
14:25:30:WU00:FS01:0x22:  Copyright: 2020 foldingathome.org
14:25:30:WU00:FS01:0x22:   Homepage: https://foldingathome.org/
14:25:30:WU00:FS01:0x22:       Date: Sep 19 2020
14:25:30:WU00:FS01:0x22:       Time: 01:10:35
14:25:30:WU00:FS01:0x22:   Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
14:25:30:WU00:FS01:0x22:     Branch: core22-0.0.13
14:25:30:WU00:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
14:25:30:WU00:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
14:25:30:WU00:FS01:0x22:             -funroll-loops -DOPENMM_GIT_HASH="\"189320d0\""
14:25:30:WU00:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
14:25:30:WU00:FS01:0x22:       Bits: 64
14:25:30:WU00:FS01:0x22:       Mode: Release
14:25:30:WU00:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
14:25:30:WU00:FS01:0x22:             <peastman@stanford.edu>
14:25:30:WU00:FS01:0x22:       Args: -dir 00 -suffix 01 -version 706 -lifeline 567 -checkpoint 15
14:25:30:WU00:FS01:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
14:25:30:WU00:FS01:0x22:             0 -gpu 0
14:25:30:WU00:FS01:0x22:************************************ libFAH ******************
14:25:30:WU00:FS01:0x22:       Date: Sep 15 2020
14:25:30:WU00:FS01:0x22:       Time: 05:14:43
14:25:30:WU00:FS01:0x22:   Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
14:25:30:WU00:FS01:0x22:     Branch: HEAD
14:25:30:WU00:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
14:25:30:WU00:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
14:25:30:WU00:FS01:0x22:             -funroll-loops
14:25:30:WU00:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
14:25:30:WU00:FS01:0x22:       Bits: 64
14:25:30:WU00:FS01:0x22:       Mode: Release
14:25:30:WU00:FS01:0x22:************************************ CBang **********************
14:25:30:WU00:FS01:0x22:       Date: Sep 15 2020
14:25:30:WU00:FS01:0x22:       Time: 05:11:04
14:25:30:WU00:FS01:0x22:   Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
14:25:30:WU00:FS01:0x22:     Branch: HEAD
14:25:30:WU00:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
14:25:30:WU00:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
14:25:30:WU00:FS01:0x22:             -funroll-loops -fPIC
14:25:30:WU00:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
14:25:30:WU00:FS01:0x22:       Bits: 64
14:25:30:WU00:FS01:0x22:       Mode: Release
14:25:30:WU00:FS01:0x22:************************************ System ********************
14:25:30:WU00:FS01:0x22:        CPU: Intel(R) Xeon(R) CPU @ 2.30GHz
14:25:30:WU00:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 63 Stepping 0
14:25:30:WU00:FS01:0x22:       CPUs: 1
14:25:30:WU00:FS01:0x22:     Memory: 1.70GiB
14:25:30:WU00:FS01:0x22:Free Memory: 1.42GiB
14:25:30:WU00:FS01:0x22:    Threads: POSIX_THREADS
14:25:30:WU00:FS01:0x22: OS Version: 4.19
14:25:30:WU00:FS01:0x22:Has Battery: false
14:25:30:WU00:FS01:0x22: On Battery: false
14:25:30:WU00:FS01:0x22: UTC Offset: 0
14:25:30:WU00:FS01:0x22:        PID: 571
14:25:30:WU00:FS01:0x22:        CWD: /var/lib/fahclient/work
14:25:30:WU00:FS01:0x22:************************************ OpenMM *********************
14:25:30:WU00:FS01:0x22:   Revision: 189320d0
14:25:30:WU00:FS01:0x22:******************************************************************
14:25:30:WU00:FS01:0x22:Project: 17424 (Run 0, Clone 1149, Gen 17)
14:25:30:WU00:FS01:0x22:Unit: 0x00000000000000000000000000000000
14:25:30:WU00:FS01:0x22:Digital signatures verified
14:25:30:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
14:25:30:WU00:FS01:0x22:Version 0.0.13
14:25:30:WU00:FS01:0x22:  Checkpoint write interval: 25000 steps (2%) [50 total]
14:25:30:WU00:FS01:0x22:  JSON viewer frame write interval: 12500 steps (1%) [100 total]
14:25:30:WU00:FS01:0x22:  XTC frame write interval: 10000 steps (0.8%) [125 total]
14:25:30:WU00:FS01:0x22:  Global context and integrator variables write interval: disabled
14:25:32:WU00:FS01:0x22:There are 4 platforms available.
14:25:32:WU00:FS01:0x22:Platform 0: Reference
14:25:32:WU00:FS01:0x22:Platform 1: CPU
14:25:32:WU00:FS01:0x22:Platform 2: OpenCL
14:25:32:WU00:FS01:0x22:  opencl-device 0 specified
14:25:32:WU00:FS01:0x22:Platform 3: CUDA
14:25:32:WU00:FS01:0x22:  cuda-device 0 specified
14:25:40:WU00:FS01:0x22:Attempting to create CUDA context:
14:25:40:WU00:FS01:0x22:  Configuring platform CUDA
14:25:51:WU00:FS01:0x22:  Using CUDA and gpu 0
14:25:52:WU00:FS01:0x22:Completed 200000 out of 1250000 steps (16%)
14:27:19:WU00:FS01:0x22:Completed 212500 out of 1250000 steps (17%)
14:28:06:FS01:Paused
14:28:06:FS01:Shutting core down
14:28:06:WU00:FS01:0x22:Caught signal SIGINT(2) on PID 571
14:28:06:WU00:FS01:0x22:Exiting, please wait. . .

Re: repeated INTERRUPTED after 100%- 17322 (0, 993,21)

Posted: Fri Dec 25, 2020 12:17 am
by PantherX
It seems that you might have ended up with 2 WUs (each with progress) on a single Slot. See what happens once the current WU finishes as I think the Slot will see that you have another WU and then package it before sending it.

Re: repeated INTERRUPTED after 100%- 17322 (0, 993,21)

Posted: Fri Dec 25, 2020 8:39 am
by Knish
endless looping between 100% and INTERRUPTED now, i think either this GPU, or this WU is hosed, but it still can complete other different WUs

Re: repeated INTERRUPTED after 100%- 17322 (0, 993,21)

Posted: Fri Dec 25, 2020 9:09 am
by Knish
..wait, maybe something weird with GCP itself. I had spun up a new instance for single GPU which has been running all day until processing suddenly dropped in the monitoring with a corresponding halt FAH logging. Everything was on, so it wasn't preempted, but there was no logging/ calculating going on. I rebooted and P14911 continued where it left off. Examining logs reveals a mysterious gap from 0804Z, the last entry at 22% progress, and 0842Z where it starts back up from 20%

... and as I type remote monitoring of this GPU in my FahControl dropped off like the other one, saying "Updating" but I got another clue: my ssh using putty was still barely connected and things were real sluggish, as in many seconds delay in typing things. Tried to open the last log again in an editor and got error:

Code: Select all

-bash: fork: Cannot allocate memory
and then i lost ssh connection

VM is still running and cpu that feeds the gpu has dropped to 0% usage.

Maybe Santa wants to machine learn our predictive behavior so he doesn't have to keep lists by hand anymore.

Merry Christmas, Happy holidays, and here's to a happy new year

Re: repeated INTERRUPTED after 100%- 17322 (0, 993,21)

Posted: Fri Dec 25, 2020 3:23 pm
by Knish
Coincidence? noticed after the fact that around 1100Z WS 140.**.200 restarted. my 17322 then uploaded! BUT I had also increased RAM for that VM from 1.75 GB to 2.

So I switched to other VM which only had 1.5 gb ram and it started hitting same issues with 14911, increased that to 2gb ram as well, and that then completed with no more problems.

running top in the 1.5GB system I frequently saw 70MiB of ram still free, and saw it get as low as 50, so thought I was still ok.
After increasing to 2GB tho, I sometimes see it dip down from 100 to 70MiB free, so I'm guessing this was all a RAM issue? It's a shame I forgot to note the buffer/cache size (if that was important).

So I guess I stumbled upon a minimum for Folding with a GPU with linux. After nearly 2 billion points in 8 months using (among others) a T4 with 1.75 GB RAM, looks like you'll see less issues with 2GB.

Re: repeated INTERRUPTED after 100%- 17322 (0, 993,21)

Posted: Fri Dec 25, 2020 3:38 pm
by Joe_H
The increased memory usage probably comes from the size of the system being simulated in Project 17322. At over 430,000 atoms it is currently the largest by atom count being distributed.

Re: repeated INTERRUPTED after 100%- 17322 (0, 993,21)

Posted: Fri Dec 25, 2020 8:33 pm
by PantherX
This is a tricky one as I am not sure what component (FAHClient/FahCore) should throw/write an error for being unable to package up the WU to send it. Nonetheless, I have asked around so let's see what happens :)

Re: repeated INTERRUPTED after 100%- 17322 (0, 993,21)

Posted: Fri Dec 25, 2020 9:52 pm
by Knish
Final bit of trivia, now that I'm running with 2GB of RAM, I stared at top as P17319 finished. It went from 177MiB free, and as it goes through all the "saving result file..." I saw it drop to 67MiB free, so that seems like a perfect explanation to my logs from 17322 earlier.

Re: repeated INTERRUPTED after 100%- 17322 (0, 993,21)

Posted: Fri Dec 25, 2020 10:36 pm
by toTOW
This is what happen when you don't allocate enough memory to the VM, and don't use swap ...

I wouldn't try to fold with less than 4 GB of RAM ...

Re: repeated INTERRUPTED after 100%- 17322 (0, 993,21)

Posted: Sat Jan 02, 2021 10:43 am
by Gnomuz
I've started a GCP instance two days ago, following the guide by Knish, and it works great. But I also got two failed units on project 17322, exactly the same symptoms as described by Knish, despite I created the instance with 2GB RAM. I've just edited the instance to 3GB RAM and will check next time I get a WU from this project.
The RAM costs are reported as 0.03 or 0.04€ per day, so that shouldn't have a big impact on the overall cost of the VM !

Re: repeated INTERRUPTED after 100%- 17322 (0, 993,21)

Posted: Sat Jan 02, 2021 8:37 pm
by Knish
oh, Thanks for this update!