Page 1 of 2

How to abort a way to big WU?

Posted: Thu Apr 02, 2020 1:06 pm
by asdhjsdjdj
Dear people,

I have a WU running on a GPU which is simply too big. The ETA is 7 days and the expiration date is in 7 days as well. I am not able to keep my computer running without breaks at full power for a week because I need to work on it. On top of that, when I pause the process it actually looses some of its progress which is probably a technical issue I should post somewhere else.
To summarize I know that I will not complete this WU before it expires and I really want to simply abort it.
Can anyone tell me how to do it?

Thanks!

Re: How to abort a way to big WU?

Posted: Thu Apr 02, 2020 4:28 pm
by Jorgeminator
What GPU do you have? Seven days sounds like a long time for any GPU... When you pause the WU it falls back to the latest checkpoint, that's normal.

Re: How to abort a way to big WU?

Posted: Thu Apr 02, 2020 4:46 pm
by Neil-B
Would be worth posting a log including both the log header and the start of the WU so that someone can check if it is malformed and report it … They will then give you the best way to dump it.

Re: How to abort a way to big WU?

Posted: Thu Apr 02, 2020 10:13 pm
by uyaem
asdhjsdjdj wrote:[...] when I pause the process it actually looses some of its progress which is probably a technical issue I should post somewhere else.
This may be down to the following setting and not actually a bug:
Image

Re: How to abort a way to big WU?

Posted: Thu Apr 02, 2020 10:25 pm
by Neil-B
You need to post a log - viewtopic.php?f=24&t=26036 - there are a few malformed WUs around, or it may be a configuration issue … but without it any suggestions are simply guesswork.

Re: How to abort a way to big WU?

Posted: Thu Apr 02, 2020 10:34 pm
by jrweiss
Also, the time estimates for WUs are not accurate early in the process. You might give it a day...

Re: How to abort a way to big WU?

Posted: Fri Apr 03, 2020 1:32 am
by Joe_H
The checkpoint frequency set in FAHControl only applies to CPU folding. For GPU folding the checkpoint occurs at an interval set by the researcher when they set up the project. Typically the interval is between 2 and 5%.

As you have not posted any part of your log, we have no idea which project you have, or what kind of GPU it is being processed on.

Re: How to abort a way to big WU?

Posted: Fri Apr 03, 2020 1:48 am
by bruce
Suppose you receive a new (to you) WU assignment for your GPU, no estimate can be made until we know the actual time between starting to fold it and reaching the first checkpoint. Suppose the researcher has set the first checkpoint as 2% and it takes your machine 2 hrs to get to 2%. An estimated duration of the WU is therefor 100 hrs for 100% or just over 4 days to complete the WU.

But what estimate should FAH post BEFORE completing that first 2 hrs of work? You have completed an unknown amount of progress in an unknown amount of time. FAHClient needs to make a wild guess. All you know is that somebody set a deadline of 7 days and your machine has started working on it. Eureka: How about us guessing that it will be completed by the deadline. It's just an estimate anyway. Nobody will assume it's accurate and it will be refined in a couple of hours anyway when there is some ACTUAL data on which to make a better estimate.

All NEW WUs start out estimating you'll finish by the deadline. If you eventually get another assignment for the same project some time in the future, the initial estimated rate of progress will be based on this first WU and it will be reasonably accurate even though there is no actual data for that second WU.

If you receive a new WU for your CPU, however, the first checkpoint will be after 15 minutes processing, estimated durations of CPU assignments converge to actual numbers much more quickly.

Re: How to abort a way to big WU?

Posted: Fri Apr 03, 2020 10:46 pm
by STFC9F22
Hi,
I have no authoritative knowledge of the workings of the software, however I do have some years experience of running the Advanced Control in Windows.

Regarding the pause process causing work to be lost, I believe the replies above to be incorrect (at least for my environment, Windows 10 using the Advanced Control). For me work is not actually lost on Pause although the Advanced Control software might give that impression. When pause is pressed the Progress figure typically reverts to the whole percentage figure below the current progress e.g. 89.76% changes to read 89.00%. My assumption is that this corresponds to the previous checkpoint (for GPUs as set by the project, overriding the checkpoint frequency value in the configuration settings as Bruce explains above). However, it seems to me that Pause closes down gracefully by either saving the current state or, more likely, the incremental change from the checkpoint. When monitoring the Progress figure on resuming from pause, although the Progress does initially report the lower whole percentage figure this rapidly (within seconds) reverts to the full progress figure (or slightly higher) shown at the point of Pause.

My understanding is that progress only fully reverts to the previous checkpoint in an uncontrolled shutdown and as such circumstances for me are extremely rare I set my configuration to the maximum 30 minutes (albeit as Bruce states this configuration setting only applies to CPU Work Units).

EDIT: *** Apologies***

In the light of Joe’s post below I have checked and agree that what I have posted applies only to CPU work Units and that work is in fact lost by pausing GPU Work Units.

In the case of CPU Work Units pressing pause causes a fresh checkpoint to be taken when the processing stops. Although the displayed Progress percentage is initially truncated to a whole number, on resumption of processing the process resumes from the fractional percentage at which it was paused. In Windows this can be seen in the ‘FAHClient -> Work -> <Work Queue ID> -> 01’ folder where on pause the file timestamps indicate that the ‘state.cpt’ file is copied to the ‘state_prev.cpt’ file and the ‘state.cpt’ file is updated.

In the case of GPU work Units a different scheme applies where pause does not trigger a fresh checkpoint. Again the displayed Progress is initially truncated to a whole number but if that does not correspond to the last checkpoint, processing will resume from a lower percentage (as Joe advises a multiple of 5%). For GPU Work Units in Windows the timestamp on the ‘checkpointState.xml’ file in the ‘FAHClient ->Work -> <Work Queue ID> -> 01’ folder appears to show the time of the most recent checkpoint giving an indication of how much processing time would be lost by pressing pause.

Re: How to abort a way to big WU?

Posted: Sat Apr 04, 2020 4:30 am
by Joe_H
Your understanding only applies to CPU WUs. As I explained above, the checkpoint frequency is set by the researcher. If it is set for every 5%, then if you pause at any point, then the GPU WU will restart at the last 5% checkpoint.

If you want to verify this, look at the actual figures entered into the log file.

Re: How to abort a way to big WU?

Posted: Wed Apr 15, 2020 6:03 pm
by asdhjsdjdj
Thank you for your replies and sorry I didn't post the logs earlier.

So I allowed the Client to finish the unit and it did indeed take a week to do so.
I'll paste the last log of the GPU WU try to remove as much of the log about the CPU WUs that were processed at the same time.

Code: Select all

*********************** Log Started 2020-04-07T06:00:59Z ***********************
06:00:59:************************* Folding@home Client *************************
06:00:59:      Website: https://foldingathome.org/
06:00:59:    Copyright: (c) 2009-2018 foldingathome.org
06:00:59:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
06:00:59:         Args: --child --lifeline 1812 /etc/fahclient/config.xml --run-as
06:00:59:               fahclient --pid-file=/var/run/fahclient.pid --daemon
06:00:59:       Config: /etc/fahclient/config.xml
06:00:59:******************************** Build ********************************
06:00:59:      Version: 7.5.1
06:00:59:         Date: May 11 2018
06:00:59:         Time: 19:59:04
06:00:59:   Repository: Git
06:00:59:     Revision: 4705bf53c635f88b8fe85af7675557e15d491ff0
06:00:59:       Branch: master
06:00:59:     Compiler: GNU 6.3.0 20170516
06:00:59:      Options: -std=gnu++98 -O3 -funroll-loops
06:00:59:     Platform: linux2 4.14.0-3-amd64
06:00:59:         Bits: 64
06:00:59:         Mode: Release
06:00:59:******************************* System ********************************
06:00:59:          CPU: Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz
06:00:59:       CPU ID: GenuineIntel Family 6 Model 42 Stepping 7
06:00:59:         CPUs: 4
06:00:59:       Memory: 7.62GiB
06:00:59:  Free Memory: 6.81GiB
06:00:59:      Threads: POSIX_THREADS
06:00:59:   OS Version: 4.15
06:00:59:  Has Battery: false
06:00:59:   On Battery: false
06:00:59:   UTC Offset: 2
06:00:59:          PID: 1814
06:00:59:          CWD: /var/lib/fahclient
06:00:59:           OS: Linux 4.15.0-91-generic x86_64
06:00:59:      OS Arch: AMD64
06:00:59:         GPUs: 1
06:00:59:        GPU 0: Bus:1 Slot:0 Func:0 NVIDIA:2 GF119 [GeForce GT 610]
06:00:59:CUDA Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:2.1 Driver:9.0
06:00:59:       OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so':
06:00:59:               libOpenCL.so: cannot open shared object file: No such file or
06:00:59:               directory
06:00:59:***********************************************************************
06:00:59:<config>
06:00:59:  <!-- Client Control -->
06:00:59:  <fold-anon v='true'/>
06:00:59:
06:00:59:  <!-- Network -->
06:00:59:  <proxy v=':8080'/>
06:00:59:
06:00:59:  <!-- Slot Control -->
06:00:59:  <power v='full'/>
06:00:59:
06:00:59:  <!-- User Information -->
06:00:59:  <passkey v='********************************'/>
06:00:59:  <team v='256476'/>
06:00:59:  <user v='equaldividing'/>
06:00:59:
06:00:59:  <!-- Folding Slots -->
06:00:59:  <slot id='0' type='CPU'/>
06:00:59:  <slot id='1' type='GPU'>
06:00:59:    <gpu-index v='0'/>
06:00:59:    <opencl-index v='0'/>
06:00:59:  </slot>
06:00:59:</config>
06:00:59:Switching to user fahclient
06:00:59:Trying to access database...
06:00:59:Successfully acquired database lock
06:00:59:Enabled folding slot 00: READY cpu:3
06:00:59:Enabled folding slot 01: READY gpu:0:GF119 [GeForce GT 610]
06:00:59:WU01:FS01:Starting
06:00:59:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 705 -lifeline 1814 -checkpoint 15 -gpu-vendor nvidia -opencl-device 0 -cuda-device 0 -gpu 0
06:00:59:WU01:FS01:Started FahCore on PID 1824
06:00:59:WU01:FS01:Core PID:1828
06:00:59:WU01:FS01:FahCore 0x22 started
06:01:00:WU01:FS01:0x22:*********************** Log Started 2020-04-07T06:00:59Z ***********************
06:01:00:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
06:01:00:WU01:FS01:0x22:       Type: 0x22
06:01:00:WU01:FS01:0x22:       Core: Core22
06:01:00:WU01:FS01:0x22:    Website: https://foldingathome.org/
06:01:00:WU01:FS01:0x22:  Copyright: (c) 2009-2018 foldingathome.org
06:01:00:WU01:FS01:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
06:01:00:WU01:FS01:0x22:             <rafal.wiewiora@choderalab.org>
06:01:00:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 705 -lifeline 1824 -checkpoint 15
06:01:00:WU01:FS01:0x22:             -gpu-vendor nvidia -opencl-device 0 -cuda-device 0 -gpu 0
06:01:00:WU01:FS01:0x22:     Config: <none>
06:01:00:WU01:FS01:0x22:************************************ Build *************************************
06:01:00:WU01:FS01:0x22:    Version: 0.0.2
06:01:00:WU01:FS01:0x22:       Date: Dec 6 2019
06:01:00:WU01:FS01:0x22:       Time: 21:20:17
06:01:00:WU01:FS01:0x22: Repository: Git
06:01:00:WU01:FS01:0x22:   Revision: f87d92b58abdf7e6bf2e173cfbc4dc3e837c7042
06:01:00:WU01:FS01:0x22:     Branch: core22
06:01:00:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
06:01:00:WU01:FS01:0x22:    Options: -std=gnu++98 -O3 -funroll-loops
06:01:00:WU01:FS01:0x22:   Platform: linux2 4.9.87-linuxkit-aufs
06:01:00:WU01:FS01:0x22:       Bits: 64
06:01:00:WU01:FS01:0x22:       Mode: Release
06:01:00:WU01:FS01:0x22:************************************ System ************************************
06:01:00:WU01:FS01:0x22:        CPU: Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz
06:01:00:WU01:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 42 Stepping 7
06:01:00:WU01:FS01:0x22:       CPUs: 4
06:01:00:WU01:FS01:0x22:     Memory: 7.62GiB
06:01:00:WU01:FS01:0x22:Free Memory: 6.80GiB
06:01:00:WU01:FS01:0x22:    Threads: POSIX_THREADS
06:01:00:WU01:FS01:0x22: OS Version: 4.15
06:01:00:WU01:FS01:0x22:Has Battery: false
06:01:00:WU01:FS01:0x22: On Battery: false
06:01:00:WU01:FS01:0x22: UTC Offset: 2
06:01:00:WU01:FS01:0x22:        PID: 1828
06:01:00:WU01:FS01:0x22:        CWD: /var/lib/fahclient/work
06:01:00:WU01:FS01:0x22:         OS: Linux 4.15.0-91-generic x86_64
06:01:00:WU01:FS01:0x22:    OS Arch: AMD64
06:01:00:WU01:FS01:0x22:********************************************************************************
06:01:00:WU01:FS01:0x22:Project: 11779 (Run 0, Clone 4590, Gen 16)
06:01:00:WU01:FS01:0x22:Unit: 0x0000001b0d5a98395e73c5994165200f
06:01:00:WU01:FS01:0x22:Digital signatures verified
06:01:00:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
06:01:00:WU01:FS01:0x22:Version 0.0.2
06:01:00:WU01:FS01:0x22:  Found a checkpoint file
06:01:15:WU01:FS01:0x22:Completed 600000 out of 1000000 steps (60%)
06:01:15:WU01:FS01:0x22:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
08:10:34:WU01:FS01:0x22:Completed 620000 out of 1000000 steps (62%)
09:17:17:WU01:FS01:0x22:Completed 630000 out of 1000000 steps (63%)
10:23:09:WU01:FS01:0x22:Completed 640000 out of 1000000 steps (64%)
11:28:17:WU01:FS01:0x22:Completed 650000 out of 1000000 steps (65%)
******************************* Date: 2020-04-07 *******************************
12:33:16:WU01:FS01:0x22:Completed 660000 out of 1000000 steps (66%)
13:38:00:WU01:FS01:0x22:Completed 670000 out of 1000000 steps (67%)
14:42:44:WU01:FS01:0x22:Completed 680000 out of 1000000 steps (68%)
17:56:50:WU01:FS01:0x22:Completed 710000 out of 1000000 steps (71%)
******************************* Date: 2020-04-07 *******************************
19:01:25:WU01:FS01:0x22:Completed 720000 out of 1000000 steps (72%)
21:10:50:WU01:FS01:0x22:Completed 740000 out of 1000000 steps (74%)
22:15:34:WU01:FS01:0x22:Completed 750000 out of 1000000 steps (75%)
23:20:25:WU01:FS01:0x22:Completed 760000 out of 1000000 steps (76%)
00:25:09:WU01:FS01:0x22:Completed 770000 out of 1000000 steps (77%)
******************************* Date: 2020-04-08 *******************************
01:29:57:WU01:FS01:0x22:Completed 780000 out of 1000000 steps (78%)
02:34:43:WU01:FS01:0x22:Completed 790000 out of 1000000 steps (79%)
03:39:31:WU01:FS01:0x22:Completed 800000 out of 1000000 steps (80%)
04:44:21:WU01:FS01:0x22:Completed 810000 out of 1000000 steps (81%)
05:49:04:WU01:FS01:0x22:Completed 820000 out of 1000000 steps (82%)
06:53:46:WU01:FS01:0x22:Completed 830000 out of 1000000 steps (83%)
******************************* Date: 2020-04-08 *******************************
07:58:27:WU01:FS01:0x22:Completed 840000 out of 1000000 steps (84%)
09:03:08:WU01:FS01:0x22:Completed 850000 out of 1000000 steps (85%)
10:07:54:WU01:FS01:0x22:Completed 860000 out of 1000000 steps (86%)
11:12:42:WU01:FS01:0x22:Completed 870000 out of 1000000 steps (87%)
12:17:28:WU01:FS01:0x22:Completed 880000 out of 1000000 steps (88%)
******************************* Date: 2020-04-08 *******************************
13:22:14:WU01:FS01:0x22:Completed 890000 out of 1000000 steps (89%)
14:27:02:WU01:FS01:0x22:Completed 900000 out of 1000000 steps (90%)
15:31:51:WU01:FS01:0x22:Completed 910000 out of 1000000 steps (91%)
16:36:37:WU01:FS01:0x22:Completed 920000 out of 1000000 steps (92%)
17:42:09:WU01:FS01:0x22:Completed 930000 out of 1000000 steps (93%)
18:46:44:WU01:FS01:0x22:Completed 940000 out of 1000000 steps (94%)
******************************* Date: 2020-04-08 *******************************
19:51:22:WU01:FS01:0x22:Completed 950000 out of 1000000 steps (95%)
20:56:02:WU01:FS01:0x22:Completed 960000 out of 1000000 steps (96%)
22:00:39:WU01:FS01:0x22:Completed 970000 out of 1000000 steps (97%)
22:23:51:FS01:Finishing
23:06:08:WU01:FS01:0x22:Completed 980000 out of 1000000 steps (98%)
00:11:48:WU01:FS01:0x22:Completed 990000 out of 1000000 steps (99%)
******************************* Date: 2020-04-09 *******************************
01:18:33:WU01:FS01:0x22:Completed 1000000 out of 1000000 steps (100%)
01:18:36:WU01:FS01:0x22:Saving result file ../logfile_01.txt
01:18:36:WU01:FS01:0x22:Saving result file checkpointState.xml
01:18:36:WU01:FS01:0x22:Saving result file checkpt.crc
01:18:36:WU01:FS01:0x22:Saving result file positions.xtc
01:18:36:WU01:FS01:0x22:Saving result file science.log
01:18:36:WU01:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
01:18:37:WU01:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
01:18:37:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:11779 run:0 clone:4590 gen:16 core:0x22 unit:0x0000001b0d5a98395e73c5994165200f
01:18:37:WU01:FS01:Uploading 33.04MiB to 13.90.152.57
01:18:37:WU01:FS01:Connecting to 13.90.152.57:8080
01:18:57:WU01:FS01:Upload 0.76%
01:19:03:WU01:FS01:Upload 3.40%
01:19:09:WU01:FS01:Upload 6.05%
01:19:16:WU01:FS01:Upload 8.32%
01:19:23:WU01:FS01:Upload 10.21%
01:19:29:WU01:FS01:Upload 12.86%
01:19:35:WU01:FS01:Upload 14.94%
01:19:41:WU01:FS01:Upload 17.02%
01:19:47:WU01:FS01:Upload 19.29%
01:19:53:WU01:FS01:Upload 21.37%
01:19:59:WU01:FS01:Upload 23.64%
01:20:05:WU01:FS01:Upload 25.53%
01:20:11:WU01:FS01:Upload 27.42%
01:20:17:WU01:FS01:Upload 29.69%
01:20:23:WU01:FS01:Upload 31.96%
01:20:29:WU01:FS01:Upload 33.48%
01:20:35:WU01:FS01:Upload 35.18%
01:20:41:WU01:FS01:Upload 37.83%
01:20:47:WU01:FS01:Upload 39.91%
01:20:53:WU01:FS01:Upload 41.80%
01:20:59:WU01:FS01:Upload 42.18%
01:21:05:WU01:FS01:Upload 44.83%
01:21:11:WU01:FS01:Upload 47.28%
01:21:17:WU01:FS01:Upload 48.80%
01:21:23:WU01:FS01:Upload 51.45%
01:21:29:WU01:FS01:Upload 53.71%
01:21:35:WU01:FS01:Upload 55.98%
01:21:42:WU01:FS01:Upload 58.25%
01:21:48:WU01:FS01:Upload 59.77%
01:21:54:WU01:FS01:Upload 61.85%
01:22:00:WU01:FS01:Upload 64.87%
01:22:06:WU01:FS01:Upload 67.33%
01:22:12:WU01:FS01:Upload 69.79%
01:22:18:WU01:FS01:Upload 72.06%
01:22:24:WU01:FS01:Upload 74.52%
01:22:30:WU01:FS01:Upload 76.79%
01:22:36:WU01:FS01:Upload 78.87%
01:22:42:WU01:FS01:Upload 81.90%
01:22:48:WU01:FS01:Upload 84.73%
01:22:54:WU01:FS01:Upload 87.57%
01:23:00:WU01:FS01:Upload 90.41%
01:23:06:WU01:FS01:Upload 92.30%
01:23:13:WU01:FS01:Upload 94.57%
01:23:19:WU01:FS01:Upload 96.84%
01:23:26:WU01:FS01:Upload 98.35%
01:23:33:WU01:FS01:Upload complete
01:23:33:WU01:FS01:Server responded WORK_ACK (400)
01:23:33:WU01:FS01:Final credit estimate, 7622.00 points
01:23:33:WU01:FS01:Cleaning up
01:30:59:Lost lifeline PID 1812, exiting
01:31:00:Clean exit

Re: How to abort a way to big WU?

Posted: Wed Apr 15, 2020 6:10 pm
by HaloJones
Hate to say it but a week's folding to produce 7622 points doesn't seem like it's worth it. That's around 1000ppd. For reference, if you dropped $200 on a used 1070 you could get 800000ppd so 800x as much.

Re: How to abort a way to big WU?

Posted: Wed Apr 15, 2020 6:18 pm
by JimboPalmer
I would just remove the GPU slot.

It does not support OpenCL 1.2 and so should not be getting work. It is about 7 times slower that the weakest 'new' Graphics card you can buy, the GT 1030.

Removing the GPU slot may free up one more CPU to fold.

Re: How to abort a way to big WU?

Posted: Wed Apr 15, 2020 6:27 pm
by bruce
Adding the "he asked to pause, but I'm going to make a checkpoint first" function sounds like a great idea but it seems to be causing corrupt checkpoint files.

When you tell Windows to shut down, it notifies the running processes and many shut down immediately. The ones that are left may continue to process for a few moments OR the program may issue a popup like "You need to save your work"

Windows waits a predetermined time and then kills the remaining processes. Unfortunately sometimes FAH takes too long syncing parts of the FAHCore internal processes before starting to write the checkpoint. Killing the FAHCore during the internal processing, the closing of the files and/or syncing the cache to disk at that point in time is very dangerous.

Re: How to abort a way to big WU?

Posted: Wed Apr 15, 2020 9:39 pm
by asdhjsdjdj
I just removed the GPU slot as suggested and that also aborted the current WU. ^^
Seems like the GPU is older than I remembered it to be xD

Thanks for your help!