looks like failed upload, but pause/unpause fixes it
Posted: Mon Dec 21, 2020 9:19 pm
ver 7.6.13 folding on 1 gpu slot. I see it's chugging away at one at 26%... but there's still another work slot showing at 100% so seems like an upload problem right?
Even tho my proj 17426 had reached 100%, the key log entry after that was
WARNING:WU00:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
It would restart the core, but I think that since it had gotten past 99% already, the next WU was already downloaded and ready to go, so my 17426 got superseded. After pausing, waiting a bit, then unpausing, the core picked up the older WU at 98%, finished it, and uploaded it, then continued with the other one from where it left off at 25%
logs:
Even tho my proj 17426 had reached 100%, the key log entry after that was
WARNING:WU00:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
It would restart the core, but I think that since it had gotten past 99% already, the next WU was already downloaded and ready to go, so my 17426 got superseded. After pausing, waiting a bit, then unpausing, the core picked up the older WU at 98%, finished it, and uploaded it, then continued with the other one from where it left off at 25%
logs:
Code: Select all
19:55:23:WU00:FS01:0x22:Completed 1237500 out of 1250000 steps (99%)
19:55:24:WU01:FS01:Connecting to assign1.foldingathome.org:80
19:55:24:WU01:FS01:Assigned to work server 18.188.125.154
19:55:24:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:TU104GL [Tesla T4] 8141 from 18.188.125.154
19:55:24:WU01:FS01:Connecting to 18.188.125.154:8080
19:55:26:WU01:FS01:Downloading 7.49MiB
19:55:26:WU01:FS01:Download complete
19:55:26:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13438 run:9275 clone:12 gen:1 core:0x22 unit:0x0000000c000000010000347e0000243b
19:56:52:WU00:FS01:0x22:Completed 1250000 out of 1250000 steps (100%)
19:56:52:WU00:FS01:0x22:Average performance: 19.5254 ns/day
19:56:53:WU00:FS01:0x22:An exception occurred at step 1250000: Force RMSE error of 7.84704 with threshold of 5
19:56:53:WU00:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
19:56:53:WU00:FS01:0x22:Folding@home Core Shutdown: CORE_RESTART
19:56:54:WARNING:WU00:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
19:56:54:WU01:FS01:Starting
19:56:54:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 448 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
19:56:54:WU01:FS01:Started FahCore on PID 2347
19:56:54:WU01:FS01:Core PID:2351
19:56:54:WU01:FS01:FahCore 0x22 started
19:56:54:WU01:FS01:0x22:*********************** Log Started 2020-12-21T19:56:54Z ***********************
19:56:54:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
19:56:54:WU01:FS01:0x22: Core: Core22
19:56:54:WU01:FS01:0x22: Type: 0x22
19:56:54:WU01:FS01:0x22: Version: 0.0.13
19:56:54:WU01:FS01:0x22: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
19:56:54:WU01:FS01:0x22: Copyright: 2020 foldingathome.org
19:56:54:WU01:FS01:0x22: Homepage: https://foldingathome.org/
19:56:54:WU01:FS01:0x22: Date: Sep 19 2020
19:56:54:WU01:FS01:0x22: Time: 01:10:35
19:56:54:WU01:FS01:0x22: Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
19:56:54:WU01:FS01:0x22: Branch: core22-0.0.13
19:56:54:WU01:FS01:0x22: Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
19:56:54:WU01:FS01:0x22: Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
19:56:54:WU01:FS01:0x22: -funroll-loops -DOPENMM_GIT_HASH="\"189320d0\""
19:56:54:WU01:FS01:0x22: Platform: linux2 4.19.76-linuxkit
19:56:54:WU01:FS01:0x22: Bits: 64
19:56:54:WU01:FS01:0x22: Mode: Release
19:56:54:WU01:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
19:56:54:WU01:FS01:0x22: <peastman@stanford.edu>
19:56:54:WU01:FS01:0x22: Args: -dir 01 -suffix 01 -version 706 -lifeline 2347 -checkpoint 15
19:56:54:WU01:FS01:0x22: -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
19:56:54:WU01:FS01:0x22: 0 -gpu 0
19:56:54:WU01:FS01:0x22:************************************ libFAH ************************************
19:56:54:WU01:FS01:0x22: Date: Sep 15 2020
19:56:54:WU01:FS01:0x22: Time: 05:14:43
19:56:54:WU01:FS01:0x22: Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
19:56:54:WU01:FS01:0x22: Branch: HEAD
19:56:54:WU01:FS01:0x22: Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
19:56:54:WU01:FS01:0x22: Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
19:56:54:WU01:FS01:0x22: -funroll-loops
19:56:54:WU01:FS01:0x22: Platform: linux2 4.19.76-linuxkit
19:56:54:WU01:FS01:0x22: Bits: 64
19:56:54:WU01:FS01:0x22: Mode: Release
19:56:54:WU01:FS01:0x22:************************************ CBang *************************************
19:56:54:WU01:FS01:0x22: Date: Sep 15 2020
19:56:54:WU01:FS01:0x22: Time: 05:11:04
19:56:54:WU01:FS01:0x22: Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
19:56:54:WU01:FS01:0x22: Branch: HEAD
19:56:54:WU01:FS01:0x22: Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
19:56:54:WU01:FS01:0x22: Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
19:56:54:WU01:FS01:0x22: -funroll-loops -fPIC
19:56:54:WU01:FS01:0x22: Platform: linux2 4.19.76-linuxkit
19:56:54:WU01:FS01:0x22: Bits: 64
19:56:54:WU01:FS01:0x22: Mode: Release
19:56:54:WU01:FS01:0x22:************************************ System ************************************
19:56:54:WU01:FS01:0x22: CPU: Intel(R) Xeon(R) CPU @ 2.30GHz
19:56:54:WU01:FS01:0x22: CPU ID: GenuineIntel Family 6 Model 63 Stepping 0
19:56:54:WU01:FS01:0x22: CPUs: 1
19:56:54:WU01:FS01:0x22: Memory: 1.70GiB
19:56:54:WU01:FS01:0x22:Free Memory: 1.13GiB
19:56:54:WU01:FS01:0x22: Threads: POSIX_THREADS
19:56:54:WU01:FS01:0x22: OS Version: 4.19
19:56:54:WU01:FS01:0x22:Has Battery: false
19:56:54:WU01:FS01:0x22: On Battery: false
19:56:54:WU01:FS01:0x22: UTC Offset: 0
19:56:54:WU01:FS01:0x22: PID: 2351
19:56:54:WU01:FS01:0x22: CWD: /var/lib/fahclient/work
19:56:54:WU01:FS01:0x22:************************************ OpenMM ************************************
19:56:54:WU01:FS01:0x22: Revision: 189320d0
19:56:54:WU01:FS01:0x22:********************************************************************************
19:56:54:WU01:FS01:0x22:Project: 13438 (Run 9275, Clone 12, Gen 1)
19:56:54:WU01:FS01:0x22:Unit: 0x00000000000000000000000000000000
19:56:54:WU01:FS01:0x22:Reading tar file core.xml
19:56:54:WU01:FS01:0x22:Reading tar file integrator.xml.bz2
19:56:54:WU01:FS01:0x22:Reading tar file state.xml.bz2
19:56:54:WU01:FS01:0x22:Reading tar file system.xml.bz2
19:56:54:WU01:FS01:0x22:Digital signatures verified
19:56:54:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
19:56:54:WU01:FS01:0x22:Version 0.0.13
19:56:54:WU01:FS01:0x22: Checkpoint write interval: 50000 steps (5%) [20 total]
19:56:54:WU01:FS01:0x22: JSON viewer frame write interval: 10000 steps (1%) [100 total]
19:56:54:WU01:FS01:0x22: XTC frame write interval: 250000 steps (25%) [4 total]
19:56:54:WU01:FS01:0x22: Global context and integrator variables write interval: 25000 steps (2.5%) [40 total]
19:56:54:WU01:FS01:0x22:There are 4 platforms available.
19:56:54:WU01:FS01:0x22:Platform 0: Reference
19:56:54:WU01:FS01:0x22:Platform 1: CPU
19:56:54:WU01:FS01:0x22:Platform 2: OpenCL
19:56:54:WU01:FS01:0x22: opencl-device 0 specified
19:56:54:WU01:FS01:0x22:Platform 3: CUDA
19:56:54:WU01:FS01:0x22: cuda-device 0 specified
19:57:00:WU01:FS01:0x22:Attempting to create CUDA context:
19:57:00:WU01:FS01:0x22: Configuring platform CUDA
19:57:09:WU01:FS01:0x22: Using CUDA and gpu 0
19:57:09:WU01:FS01:0x22:Completed 0 out of 1000000 steps (0%)
19:57:09:WU01:FS01:0x22:Checkpoint completed at step 0
19:59:02:WU01:FS01:0x22:Completed 10000 out of 1000000 steps (1%)
20:00:54:WU01:FS01:0x22:Completed 20000 out of 1000000 steps (2%)
20:02:46:WU01:FS01:0x22:Completed 30000 out of 1000000 steps (3%)
20:04:37:WU01:FS01:0x22:Completed 40000 out of 1000000 steps (4%)
20:06:29:WU01:FS01:0x22:Completed 50000 out of 1000000 steps (5%) )