Page 1 of 1

looks like failed upload, but pause/unpause fixes it

Posted: Mon Dec 21, 2020 9:19 pm
by Knish
ver 7.6.13 folding on 1 gpu slot. I see it's chugging away at one at 26%... but there's still another work slot showing at 100% so seems like an upload problem right?

Even tho my proj 17426 had reached 100%, the key log entry after that was
WARNING:WU00:FS01:FahCore returned: CORE_RESTART (98 = 0x62)

It would restart the core, but I think that since it had gotten past 99% already, the next WU was already downloaded and ready to go, so my 17426 got superseded. After pausing, waiting a bit, then unpausing, the core picked up the older WU at 98%, finished it, and uploaded it, then continued with the other one from where it left off at 25%

logs:

Code: Select all

19:55:23:WU00:FS01:0x22:Completed 1237500 out of 1250000 steps (99%)
19:55:24:WU01:FS01:Connecting to assign1.foldingathome.org:80
19:55:24:WU01:FS01:Assigned to work server 18.188.125.154
19:55:24:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:TU104GL [Tesla T4] 8141 from 18.188.125.154
19:55:24:WU01:FS01:Connecting to 18.188.125.154:8080
19:55:26:WU01:FS01:Downloading 7.49MiB
19:55:26:WU01:FS01:Download complete
19:55:26:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13438 run:9275 clone:12 gen:1 core:0x22 unit:0x0000000c000000010000347e0000243b
19:56:52:WU00:FS01:0x22:Completed 1250000 out of 1250000 steps (100%)
19:56:52:WU00:FS01:0x22:Average performance: 19.5254 ns/day
19:56:53:WU00:FS01:0x22:An exception occurred at step 1250000: Force RMSE error of 7.84704 with threshold of 5
19:56:53:WU00:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
19:56:53:WU00:FS01:0x22:Folding@home Core Shutdown: CORE_RESTART
19:56:54:WARNING:WU00:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
19:56:54:WU01:FS01:Starting
19:56:54:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 448 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
19:56:54:WU01:FS01:Started FahCore on PID 2347
19:56:54:WU01:FS01:Core PID:2351
19:56:54:WU01:FS01:FahCore 0x22 started
19:56:54:WU01:FS01:0x22:*********************** Log Started 2020-12-21T19:56:54Z ***********************
19:56:54:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
19:56:54:WU01:FS01:0x22:       Core: Core22
19:56:54:WU01:FS01:0x22:       Type: 0x22
19:56:54:WU01:FS01:0x22:    Version: 0.0.13
19:56:54:WU01:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
19:56:54:WU01:FS01:0x22:  Copyright: 2020 foldingathome.org
19:56:54:WU01:FS01:0x22:   Homepage: https://foldingathome.org/
19:56:54:WU01:FS01:0x22:       Date: Sep 19 2020
19:56:54:WU01:FS01:0x22:       Time: 01:10:35
19:56:54:WU01:FS01:0x22:   Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
19:56:54:WU01:FS01:0x22:     Branch: core22-0.0.13
19:56:54:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
19:56:54:WU01:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
19:56:54:WU01:FS01:0x22:             -funroll-loops -DOPENMM_GIT_HASH="\"189320d0\""
19:56:54:WU01:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
19:56:54:WU01:FS01:0x22:       Bits: 64
19:56:54:WU01:FS01:0x22:       Mode: Release
19:56:54:WU01:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
19:56:54:WU01:FS01:0x22:             <peastman@stanford.edu>
19:56:54:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 706 -lifeline 2347 -checkpoint 15
19:56:54:WU01:FS01:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
19:56:54:WU01:FS01:0x22:             0 -gpu 0
19:56:54:WU01:FS01:0x22:************************************ libFAH ************************************
19:56:54:WU01:FS01:0x22:       Date: Sep 15 2020
19:56:54:WU01:FS01:0x22:       Time: 05:14:43
19:56:54:WU01:FS01:0x22:   Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
19:56:54:WU01:FS01:0x22:     Branch: HEAD
19:56:54:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
19:56:54:WU01:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
19:56:54:WU01:FS01:0x22:             -funroll-loops
19:56:54:WU01:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
19:56:54:WU01:FS01:0x22:       Bits: 64
19:56:54:WU01:FS01:0x22:       Mode: Release
19:56:54:WU01:FS01:0x22:************************************ CBang *************************************
19:56:54:WU01:FS01:0x22:       Date: Sep 15 2020
19:56:54:WU01:FS01:0x22:       Time: 05:11:04
19:56:54:WU01:FS01:0x22:   Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
19:56:54:WU01:FS01:0x22:     Branch: HEAD
19:56:54:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
19:56:54:WU01:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
19:56:54:WU01:FS01:0x22:             -funroll-loops -fPIC
19:56:54:WU01:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
19:56:54:WU01:FS01:0x22:       Bits: 64
19:56:54:WU01:FS01:0x22:       Mode: Release
19:56:54:WU01:FS01:0x22:************************************ System ************************************
19:56:54:WU01:FS01:0x22:        CPU: Intel(R) Xeon(R) CPU @ 2.30GHz
19:56:54:WU01:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 63 Stepping 0
19:56:54:WU01:FS01:0x22:       CPUs: 1
19:56:54:WU01:FS01:0x22:     Memory: 1.70GiB
19:56:54:WU01:FS01:0x22:Free Memory: 1.13GiB
19:56:54:WU01:FS01:0x22:    Threads: POSIX_THREADS
19:56:54:WU01:FS01:0x22: OS Version: 4.19
19:56:54:WU01:FS01:0x22:Has Battery: false
19:56:54:WU01:FS01:0x22: On Battery: false
19:56:54:WU01:FS01:0x22: UTC Offset: 0
19:56:54:WU01:FS01:0x22:        PID: 2351
19:56:54:WU01:FS01:0x22:        CWD: /var/lib/fahclient/work
19:56:54:WU01:FS01:0x22:************************************ OpenMM ************************************
19:56:54:WU01:FS01:0x22:   Revision: 189320d0
19:56:54:WU01:FS01:0x22:********************************************************************************
19:56:54:WU01:FS01:0x22:Project: 13438 (Run 9275, Clone 12, Gen 1)
19:56:54:WU01:FS01:0x22:Unit: 0x00000000000000000000000000000000
19:56:54:WU01:FS01:0x22:Reading tar file core.xml
19:56:54:WU01:FS01:0x22:Reading tar file integrator.xml.bz2
19:56:54:WU01:FS01:0x22:Reading tar file state.xml.bz2
19:56:54:WU01:FS01:0x22:Reading tar file system.xml.bz2
19:56:54:WU01:FS01:0x22:Digital signatures verified
19:56:54:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
19:56:54:WU01:FS01:0x22:Version 0.0.13
19:56:54:WU01:FS01:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
19:56:54:WU01:FS01:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
19:56:54:WU01:FS01:0x22:  XTC frame write interval: 250000 steps (25%) [4 total]
19:56:54:WU01:FS01:0x22:  Global context and integrator variables write interval: 25000 steps (2.5%) [40 total]
19:56:54:WU01:FS01:0x22:There are 4 platforms available.
19:56:54:WU01:FS01:0x22:Platform 0: Reference
19:56:54:WU01:FS01:0x22:Platform 1: CPU
19:56:54:WU01:FS01:0x22:Platform 2: OpenCL
19:56:54:WU01:FS01:0x22:  opencl-device 0 specified
19:56:54:WU01:FS01:0x22:Platform 3: CUDA
19:56:54:WU01:FS01:0x22:  cuda-device 0 specified
19:57:00:WU01:FS01:0x22:Attempting to create CUDA context:
19:57:00:WU01:FS01:0x22:  Configuring platform CUDA
19:57:09:WU01:FS01:0x22:  Using CUDA and gpu 0
19:57:09:WU01:FS01:0x22:Completed 0 out of 1000000 steps (0%)
19:57:09:WU01:FS01:0x22:Checkpoint completed at step 0
19:59:02:WU01:FS01:0x22:Completed 10000 out of 1000000 steps (1%)
20:00:54:WU01:FS01:0x22:Completed 20000 out of 1000000 steps (2%)
20:02:46:WU01:FS01:0x22:Completed 30000 out of 1000000 steps (3%)
20:04:37:WU01:FS01:0x22:Completed 40000 out of 1000000 steps (4%)
20:06:29:WU01:FS01:0x22:Completed 50000 out of 1000000 steps (5%) )

Re: looks like failed upload, but pause/unpause fixes it

Posted: Tue Dec 22, 2020 8:41 am
by PantherX
I think that i have read it somewhere but not sure if this was reported or not. You can have a look here (https://github.com/FoldingAtHome/fah-issues/issues) and if not reported, you can create an issue by following the template :)

Re: looks like failed upload, but pause/unpause fixes it

Posted: Thu Dec 24, 2020 9:59 pm
by bruce
If the WU gets restarted after a second WU has been downloaded, I'm not sure how FAHClient decides which one to run. (This issue has been discussed earlier.) FAH's basic design does attempt to minimize the percentage of time when two WUs are enqueued on the same slot.

I set next-unit-percentage to 100 which changes the overlap period to 0% but still overlaps the download with the final preparation of the active WU for upload.

Re: looks like failed upload, but pause/unpause fixes it

Posted: Fri Dec 25, 2020 12:18 am
by psaam0001
bruce wrote:If the WU gets restarted after a second WU has been downloaded, I'm not sure how FAHClient decides which one to run. (This issue has been discussed earlier.) FAH's basic design does attempt to minimize the percentage of time when two WUs are enqueued on the same slot.

I set next-unit-percentage to 100 which changes the overlap period to 0% but still overlaps the download with the final preparation of the active WU for upload.
I may experiment with that myself, to avoid the wasted time between sending one completed set of WU results, and getting the next WU to process.

Paul

Re: looks like failed upload, but pause/unpause fixes it

Posted: Fri Dec 25, 2020 12:26 am
by PantherX
The default is 99 (the original default might have been 95 but later changed to 99 as internet speed improved) and won't be changed since it's the lowest common denominator when it comes to donors who are on various internet plans across the globe. The idea being that F@H would like to maximize your system resources so it "queues" the WU by downloading it at 99% which can be from vary from few seconds to several minutes depending on your system and the WU's TFP. The loss of points is minimal thus, was seems acceptable since uploading/downloading WU can be an issue on slow internet connections.

Re: looks like failed upload, but pause/unpause fixes it

Posted: Fri Dec 25, 2020 1:22 am
by psaam0001
Point taken. I think the valid range for percentage value (using that flag) was between 95 and 99.

Folding onward!!!

Paul

Re: looks like failed upload, but pause/unpause fixes it

Posted: Fri Dec 25, 2020 2:15 am
by PantherX
In V7.6.13, the valid range is from 90 to 100 since we do need to factor in areas that are operating on low internet speed since over time, the WU size has grown a bit. I don't expect that range to change in V7.6.21

Re: looks like failed upload, but pause/unpause fixes it

Posted: Fri Dec 25, 2020 2:43 am
by psaam0001
Oh well... I'm on a good cable Internet connection, so it does not matter to me (except when I lose the connection due to issues I can't fix as a customer).

Paul

Re: looks like failed upload, but pause/unpause fixes it

Posted: Fri Dec 25, 2020 4:56 pm
by bruce
When internet speeds were slow, folks complained that their system wasn't folding during the time that a new WU was downloading so FAHClient was enhanced to allow the new WU to be downloaded while the previous WU was still computing. For me, it becomes a choice between 99% and 100%. There is a mandatory non-compute period after it gets to 100% while the data is being compressed and prepared for upload when neither WU is actually doing floating-point calculations and that's pretty close to the time it takes me to download the new WU. Thus the 100% setting achieves higher average compute time than 99% on my system. That depends on several factors so I watched the log and tried to guess the better choice.

We're not talking about huge changes in total throughput, though, either way. YMMV.