Page 1 of 1

99.99% Bug

Posted: Wed Jun 27, 2018 5:04 am
by Aurum
The 99.99% bug is still not being handled.

Code: Select all

03:53:36:WU00:FS00:0x21:*********************** Log Started 2018-06-26T03:53:35Z ***********************
03:53:36:WU00:FS00:0x21:Project: 11713 (Run 18, Clone 226, Gen 91)
03:53:36:WU00:FS00:0x21:Unit: 0x0000007b8ca304e75adf7a96e85d66f4
03:53:36:WU00:FS00:0x21:CPU: 0x00000000000000000000000000000000
03:53:36:WU00:FS00:0x21:Machine: 0
03:53:36:WU00:FS00:0x21:Reading tar file core.xml
03:53:36:WU00:FS00:0x21:Reading tar file integrator.xml
03:53:36:WU00:FS00:0x21:Reading tar file state.xml
03:53:36:WU00:FS00:0x21:Reading tar file system.xml
03:53:36:WU00:FS00:0x21:Digital signatures verified
03:53:36:WU00:FS00:0x21:Folding@home GPU Core21 Folding@home Core
03:53:36:WU00:FS00:0x21:Version 0.0.18
03:53:48:WU00:FS00:0x21:Completed 0 out of 7500000 steps (0%)
03:53:48:WU00:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
03:55:26:WU00:FS00:0x21:Completed 75000 out of 7500000 steps (1%)
03:57:08:WU00:FS00:0x21:Completed 150000 out of 7500000 steps (2%)
03:58:52:WU00:FS00:0x21:Completed 225000 out of 7500000 steps (3%)
04:00:40:WU00:FS00:0x21:Completed 300000 out of 7500000 steps (4%)
04:02:26:WU00:FS00:0x21:Completed 375000 out of 7500000 steps (5%)
04:04:12:WU00:FS00:0x21:Completed 450000 out of 7500000 steps (6%)
04:06:00:WU00:FS00:0x21:Completed 525000 out of 7500000 steps (7%)
04:07:46:WU00:FS00:0x21:Completed 600000 out of 7500000 steps (8%)
04:09:31:WU00:FS00:0x21:Completed 675000 out of 7500000 steps (9%)
04:11:18:WU00:FS00:0x21:Completed 750000 out of 7500000 steps (10%)
04:13:06:WU00:FS00:0x21:Completed 825000 out of 7500000 steps (11%)
04:14:53:WU00:FS00:0x21:Completed 900000 out of 7500000 steps (12%)
04:16:39:WU00:FS00:0x21:Completed 975000 out of 7500000 steps (13%)
04:18:27:WU00:FS00:0x21:Completed 1050000 out of 7500000 steps (14%)
04:20:13:WU00:FS00:0x21:Completed 1125000 out of 7500000 steps (15%)
04:21:59:WU00:FS00:0x21:Completed 1200000 out of 7500000 steps (16%)
04:23:46:WU00:FS00:0x21:Completed 1275000 out of 7500000 steps (17%)
04:25:33:WU00:FS00:0x21:Completed 1350000 out of 7500000 steps (18%)
04:27:19:WU00:FS00:0x21:Completed 1425000 out of 7500000 steps (19%)
04:29:04:WU00:FS00:0x21:Completed 1500000 out of 7500000 steps (20%)
04:30:54:WU00:FS00:0x21:Completed 1575000 out of 7500000 steps (21%)
04:32:39:WU00:FS00:0x21:Completed 1650000 out of 7500000 steps (22%)
04:34:24:WU00:FS00:0x21:Completed 1725000 out of 7500000 steps (23%)
04:36:11:WU00:FS00:0x21:Completed 1800000 out of 7500000 steps (24%)
04:37:56:WU00:FS00:0x21:Completed 1875000 out of 7500000 steps (25%)
04:39:42:WU00:FS00:0x21:Completed 1950000 out of 7500000 steps (26%)
04:41:29:WU00:FS00:0x21:Completed 2025000 out of 7500000 steps (27%)
04:43:14:WU00:FS00:0x21:Completed 2100000 out of 7500000 steps (28%)
04:45:00:WU00:FS00:0x21:Completed 2175000 out of 7500000 steps (29%)
04:46:46:WU00:FS00:0x21:Completed 2250000 out of 7500000 steps (30%)
04:48:34:WU00:FS00:0x21:Completed 2325000 out of 7500000 steps (31%)
******************************* Date: 2018-06-26 *******************************
03:53:48:WU00:FS00:0x21:Completed 0 out of 7500000 steps (0%)
03:53:48:WU00:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
03:55:26:WU00:FS00:0x21:Completed 75000 out of 7500000 steps (1%)
03:57:08:WU00:FS00:0x21:Completed 150000 out of 7500000 steps (2%)
03:58:52:WU00:FS00:0x21:Completed 225000 out of 7500000 steps (3%)
04:00:40:WU00:FS00:0x21:Completed 300000 out of 7500000 steps (4%)
04:02:26:WU00:FS00:0x21:Completed 375000 out of 7500000 steps (5%)
04:04:12:WU00:FS00:0x21:Completed 450000 out of 7500000 steps (6%)
04:06:00:WU00:FS00:0x21:Completed 525000 out of 7500000 steps (7%)
04:07:46:WU00:FS00:0x21:Completed 600000 out of 7500000 steps (8%)
04:09:31:WU00:FS00:0x21:Completed 675000 out of 7500000 steps (9%)
04:11:18:WU00:FS00:0x21:Completed 750000 out of 7500000 steps (10%)
04:13:06:WU00:FS00:0x21:Completed 825000 out of 7500000 steps (11%)
04:14:53:WU00:FS00:0x21:Completed 900000 out of 7500000 steps (12%)
04:16:39:WU00:FS00:0x21:Completed 975000 out of 7500000 steps (13%)
04:18:27:WU00:FS00:0x21:Completed 1050000 out of 7500000 steps (14%)
04:20:13:WU00:FS00:0x21:Completed 1125000 out of 7500000 steps (15%)
04:21:59:WU00:FS00:0x21:Completed 1200000 out of 7500000 steps (16%)
04:23:46:WU00:FS00:0x21:Completed 1275000 out of 7500000 steps (17%)
04:25:33:WU00:FS00:0x21:Completed 1350000 out of 7500000 steps (18%)
04:27:19:WU00:FS00:0x21:Completed 1425000 out of 7500000 steps (19%)
04:29:04:WU00:FS00:0x21:Completed 1500000 out of 7500000 steps (20%)
04:30:54:WU00:FS00:0x21:Completed 1575000 out of 7500000 steps (21%)
04:32:39:WU00:FS00:0x21:Completed 1650000 out of 7500000 steps (22%)
04:34:24:WU00:FS00:0x21:Completed 1725000 out of 7500000 steps (23%)
04:36:11:WU00:FS00:0x21:Completed 1800000 out of 7500000 steps (24%)
04:37:56:WU00:FS00:0x21:Completed 1875000 out of 7500000 steps (25%)
04:39:42:WU00:FS00:0x21:Completed 1950000 out of 7500000 steps (26%)
04:41:29:WU00:FS00:0x21:Completed 2025000 out of 7500000 steps (27%)
04:43:14:WU00:FS00:0x21:Completed 2100000 out of 7500000 steps (28%)
04:45:00:WU00:FS00:0x21:Completed 2175000 out of 7500000 steps (29%)
04:46:46:WU00:FS00:0x21:Completed 2250000 out of 7500000 steps (30%)
04:48:34:WU00:FS00:0x21:Completed 2325000 out of 7500000 steps (31%)
******************************* Date: 2018-06-26 *******************************
04:04:12:WU00:FS00:0x21:Completed 450000 out of 7500000 steps (6%)
04:06:00:WU00:FS00:0x21:Completed 525000 out of 7500000 steps (7%)
04:07:46:WU00:FS00:0x21:Completed 600000 out of 7500000 steps (8%)
04:09:31:WU00:FS00:0x21:Completed 675000 out of 7500000 steps (9%)
04:11:18:WU00:FS00:0x21:Completed 750000 out of 7500000 steps (10%)
04:13:06:WU00:FS00:0x21:Completed 825000 out of 7500000 steps (11%)
04:14:53:WU00:FS00:0x21:Completed 900000 out of 7500000 steps (12%)
04:16:39:WU00:FS00:0x21:Completed 975000 out of 7500000 steps (13%)
04:18:27:WU00:FS00:0x21:Completed 1050000 out of 7500000 steps (14%)
04:20:13:WU00:FS00:0x21:Completed 1125000 out of 7500000 steps (15%)
04:21:59:WU00:FS00:0x21:Completed 1200000 out of 7500000 steps (16%)
04:23:46:WU00:FS00:0x21:Completed 1275000 out of 7500000 steps (17%)
04:25:33:WU00:FS00:0x21:Completed 1350000 out of 7500000 steps (18%)
04:27:19:WU00:FS00:0x21:Completed 1425000 out of 7500000 steps (19%)
04:29:04:WU00:FS00:0x21:Completed 1500000 out of 7500000 steps (20%)
04:30:54:WU00:FS00:0x21:Completed 1575000 out of 7500000 steps (21%)
04:32:39:WU00:FS00:0x21:Completed 1650000 out of 7500000 steps (22%)
04:34:24:WU00:FS00:0x21:Completed 1725000 out of 7500000 steps (23%)
04:36:11:WU00:FS00:0x21:Completed 1800000 out of 7500000 steps (24%)
04:37:56:WU00:FS00:0x21:Completed 1875000 out of 7500000 steps (25%)
04:39:42:WU00:FS00:0x21:Completed 1950000 out of 7500000 steps (26%)
04:41:29:WU00:FS00:0x21:Completed 2025000 out of 7500000 steps (27%)
04:43:14:WU00:FS00:0x21:Completed 2100000 out of 7500000 steps (28%)
04:45:00:WU00:FS00:0x21:Completed 2175000 out of 7500000 steps (29%)
04:46:46:WU00:FS00:0x21:Completed 2250000 out of 7500000 steps (30%)
04:48:34:WU00:FS00:0x21:Completed 2325000 out of 7500000 steps (31%)
******************************* Date: 2018-06-26 *******************************
04:39:42:WU00:FS00:0x21:Completed 1950000 out of 7500000 steps (26%)
04:41:29:WU00:FS00:0x21:Completed 2025000 out of 7500000 steps (27%)
04:43:14:WU00:FS00:0x21:Completed 2100000 out of 7500000 steps (28%)
04:45:00:WU00:FS00:0x21:Completed 2175000 out of 7500000 steps (29%)
04:46:46:WU00:FS00:0x21:Completed 2250000 out of 7500000 steps (30%)
04:48:34:WU00:FS00:0x21:Completed 2325000 out of 7500000 steps (31%)
******************************* Date: 2018-06-26 *******************************
******************************* Date: 2018-06-26 *******************************
04:39:42:WU00:FS00:0x21:Completed 1950000 out of 7500000 steps (26%)
04:41:29:WU00:FS00:0x21:Completed 2025000 out of 7500000 steps (27%)
04:43:14:WU00:FS00:0x21:Completed 2100000 out of 7500000 steps (28%)
04:45:00:WU00:FS00:0x21:Completed 2175000 out of 7500000 steps (29%)
04:46:46:WU00:FS00:0x21:Completed 2250000 out of 7500000 steps (30%)
04:48:34:WU00:FS00:0x21:Completed 2325000 out of 7500000 steps (31%)
******************************* Date: 2018-06-26 *******************************
******************************* Date: 2018-06-26 *******************************

Re: 99.99% Bug

Posted: Wed Jun 27, 2018 11:40 am
by SteveWillis
On Linux my reboot.sh scrip handles that (among a good number of other problems to keep all the folding slots folding) by automatically executing a client restart. It can be found at https://drive.google.com/drive/folders/ ... sp=sharing

Re: 99.99% Bug

Posted: Wed Jun 27, 2018 2:08 pm
by Joe_H
The most common cause of seeing this problem is your video drivers and GPU resetting and calculations on your GPU stopping. If you are seeing this often, your system is not folding stable and you should check for overheating, or reduce any overclocking set on the GPU. You may just need to reduce the GPU memory clock, that is less important to folding speed and cuts the card's power consumption and heat output.

Re: 99.99% Bug

Posted: Wed Jun 27, 2018 3:54 pm
by bruce
FAH cannot do anything to make an unstable GPU into a stable one. You have to figure that out.

The best FAH can do is to figure out how to dump the WU that's hung (which probably isn't what you'd like to happen, since the work you've done on that WU is probably recoverable). As joe suggests, you need to reduce the overclocking or do a better job of managing the heat.

I think this is probably the best we can do https://github.com/FoldingAtHome/fah-issues/issues/1240

Re: 99.99% Bug

Posted: Wed Jun 27, 2018 4:32 pm
by SteveWillis
When this happens all that is generally required is a client restart (or reboot if you don't know how).

My Linux client restart script. Must be run as root.

Code: Select all

#!/bin/ksh
# first install ksh  sudo apt-get install ksh
if ! [ $(id -u) = 0 ]; then
   echo "This script must be run as root (sudo   path/restartclient.sh)"
   exit 1
fi
set -x
    for i in {1..5} 
    do
        systemctl stop FAHClient || true
        sleep 5
        [[ $(pgrep -c FahCore) -gt 0 ]] && pkill -e -9 FahCore
        [[ $(pgrep -c FAHClient) -gt 0 ]] && pkill -e -9 FAHClient
        sleep 5
        systemctl restart FAHClient  || true
        sleep 5
        running=$(/etc/init.d/FAHClient status|grep -c "fahclient is running")
        if [ $running == 1 ]    #success
        then
            break
        fi
        sleep 10
    done

exit

Re: 99.99% Bug

Posted: Sat Jun 30, 2018 6:07 pm
by Aurum
I never overclock or change the memory clock.
Might be a hot card, summer's here.
I'm going to use Steve's reboot script when and if I start folding again.
CURE + FLDC used to pay my $1,000 a month electric bill but no longer. Retired and can't pay that. Anyone want some sweet folding rigs at a reasonable price???

Re: 99.99% Bug

Posted: Sat Jun 30, 2018 6:28 pm
by Joe_H
Are any of your cards factory overclocked? That can be an issue at times, especially if the cooling is marginal for the ambient conditions. The card makers overclock the chips compared to the reference designs from nVidia or AMD, and they are just looking for them to be stable providing video for games and other such software. They are not usually testing for stability doing GPU number crunching.

Re: 99.99% Bug

Posted: Sat Jun 30, 2018 7:41 pm
by SteveWillis
Aurum, I wouldn't mind picking up a couple of 1080TIs at the right price. PM me if interested.