Core 22 is not ready to be on Advanced

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: Site Moderators, FAHC Science Team

Mstenholm
Posts: 84
Joined: Fri Oct 22, 2010 10:17 pm
Hardware configuration: 4 x GTX 970. Win 7.

Re: Core 22 on Advanced

Post by Mstenholm »

QuintLeo wrote:Core 22 is giving nothing but "bad work unit" errors on my machines that use a GTX 1050 or GTX 1050 ti on those cards ONLY.
Have tried deleting the entire Core_22.fah directory (it redownloaded and started working correctly on the various GTX 1070/1080/1080ti cards in the same systems) - no change.

It would seem this core is NOT COMPATABLE with the GTX 1050 or 1050 ti for some reason.

(Update)
It's may be an issue with project "11733", as I'm also seeing some "bad work unit" errors on 1070/1080/1080ti cards on that project now with core 22.
I went to the extent of deleting ALL core directories AND workunits on one Linux machine with a pair of GTX 1070 and a GTX 1080 - and now the GTX 1080 cards is giving constant "bad work unit" errors and they are ALL project 11733 so far.
The other two cards in that machine are currently crunching 11733 work units though - and they SEEM to be working.

(More Update)
the card that was giving the constant "bad work units" has now gone "failed" - and a search through the log shows all of the bad work units were compaining of "bad state detected" and "Particle coordinate nan" - and I'm STILL getting those messages added to the log on that card even though it's in "failed" mode.


For reference - this machine HAD been working reliably before this core 22 showed up, is NOT and NEVER HAS BEEN overclocked, and the cards normally run around 55-60 C when actively working on a WU.
Same for the other machines I'm having issues with - out of all of my machines (which are ALL 3-card setups that have been STABLE for many months) I only have ONE that is still running all 3 cards reliably.
Same problem here with a 2060, No OC, 80% power limit, 50 C max, no errors with 40 MHz OC and 100 % power limit on any other WUs. Three bad state and finished after 28 %
11:24:39:WU00:FS00:Downloading core from http://cores.foldingathome.org/Win32/AM ... ore_22.fah
11:24:39:WU00:FS00:Connecting to cores.foldingathome.org:80
11:24:39:WU00:FS00:FahCore 22: Downloading 3.97MiB
11:24:43:WU00:FS00:FahCore 22: Download complete
11:24:43:WU00:FS00:Valid core signature
11:24:43:WU00:FS00:Unpacked 13.22MiB to cores/cores.foldingathome.org/Win32/AMD64/NVIDIA/Fermi/Core_22.fah/FahCore_22.exe
11:24:43:WU00:FS00:Starting
.
.
11:51:08:WU00:FS00:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
11:51:08:WU00:FS00:0x22:Following exception occured: Particle coordinate is nan
11:52:07:WU00:FS00:0x22:Completed 1050000 out of 5000000 steps (21%)
11:53:07:WU00:FS00:0x22:Completed 1100000 out of 5000000 steps (22%)
11:54:06:WU00:FS00:0x22:Completed 1150000 out of 5000000 steps (23%)
11:55:06:WU00:FS00:0x22:Completed 1200000 out of 5000000 steps (24%)
11:56:06:WU00:FS00:0x22:Completed 1250000 out of 5000000 steps (25%)
11:57:05:WU00:FS00:0x22:Completed 1300000 out of 5000000 steps (26%)
11:58:02:WU00:FS00:0x22:Completed 1350000 out of 5000000 steps (27%)
11:59:00:WU00:FS00:0x22:Completed 1400000 out of 5000000 steps (28%)
11:59:23:WU00:FS00:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
11:59:23:WU00:FS00:0x22:Following exception occured: Particle coordinate is nan
11:59:23:WU00:FS00:0x22:ERROR:114: Max Retries Reached
11:59:23:WU00:FS00:0x22:Saving result file ..\logfile_01.txt
11:59:23:WU00:FS00:0x22:Saving result file badstate-0.xml
11:59:23:WU00:FS00:0x22:Saving result file badstate-1.xml
11:59:23:WU00:FS00:0x22:Saving result file badstate-2.xml
11:59:23:WU00:FS00:0x22:Saving result file checkpointState.xml
11:59:23:WU00:FS00:0x22:Saving result file checkpt.crc
11:59:23:WU00:FS00:0x22:Saving result file positions.xtc
11:59:23:WU00:FS00:0x22:Saving result file science.log
11:59:23:WU00:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
11:59:23:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
11:59:23:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:11733 run:0 clone:672 gen:2 core:0x22 unit:0x000000028ca304f15c87e9f1d8e1e86b
11:59:23:WU00:FS00:Uploading 13.16MiB to 140.163.4.241.
Foxbat
Posts: 94
Joined: Wed Dec 05, 2007 10:23 pm
Hardware configuration: Apple Mac Pro 1,1 2x2.66 GHz Dual-Core Xeon w/10 GB RAM | EVGA GTX 960, Zotac GTX 750 Ti | Ubuntu 14.04 LTS
Dell Precision T7400 2x3.0 GHz Quad-Core Xeon w/16 GB RAM | Zotac GTX 970 | Ubuntu 14.04 LTS
Apple iMac Retina 5K 4.00 GHz Core i7 w/8 GB RAM | OS X 10.11.3 (El Capitan)
Location: Michiana, USA

Re: Core 22 on Advanced

Post by Foxbat »

I am getting a lot of "Particle coordinate is nan" exceptions on my RTX 2060 with stock GPU & Memory clock speeds:

Code: Select all

12:47:19:WU00:FS01:Connecting to 65.254.110.245:8080
12:47:21:WU00:FS01:Assigned to work server 140.163.4.241
12:47:21:WU00:FS01:Requesting new work unit for slot 01: READY gpu:0:TU106 [Geforce RTX 2060] from 140.163.4.241
12:47:21:WU00:FS01:Connecting to 140.163.4.241:8080
12:47:22:WU00:FS01:Downloading 11.66MiB
12:47:26:WU00:FS01:Download complete
12:47:26:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:11733 run:0 clone:885 gen:4 core:0x22 unit:0x000000058ca304f15c8acee13eb8b439
12:47:26:WU00:FS01:Starting
12:47:26:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/Linux/AMD64/NVIDIA/Fermi/Core_22.fah/FahCore_22 -dir 00 -suffix 01 -version 705 -lifeline 1114 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
12:47:26:WU00:FS01:Started FahCore on PID 30632
12:47:26:WU00:FS01:Core PID:30636
12:47:26:WU00:FS01:FahCore 0x22 started
12:47:27:WU00:FS01:0x22:*********************** Log Started 2019-03-15T12:47:26Z ***********************
12:47:27:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
12:47:27:WU00:FS01:0x22:       Type: 0x22
12:47:27:WU00:FS01:0x22:       Core: Core22
12:47:27:WU00:FS01:0x22:    Website: https://foldingathome.org/
12:47:27:WU00:FS01:0x22:  Copyright: (c) 2009-2018 foldingathome.org
12:47:27:WU00:FS01:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
12:47:27:WU00:FS01:0x22:             <rafal.wiewiora@choderalab.org>
12:47:27:WU00:FS01:0x22:       Args: -dir 00 -suffix 01 -version 705 -lifeline 30632 -checkpoint 15
12:47:27:WU00:FS01:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
12:47:27:WU00:FS01:0x22:             0 -gpu 0
12:47:27:WU00:FS01:0x22:     Config: <none>
12:47:27:WU00:FS01:0x22:************************************ Build *************************************
12:47:27:WU00:FS01:0x22:    Version: 0.0.1
12:47:27:WU00:FS01:0x22:       Date: Feb 26 2019
12:47:27:WU00:FS01:0x22:       Time: 22:07:27
12:47:27:WU00:FS01:0x22: Repository: Git
12:47:27:WU00:FS01:0x22:   Revision: 8b95458cf6279abbca591632f7746401fb061abd
12:47:27:WU00:FS01:0x22:     Branch: core22
12:47:27:WU00:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
12:47:27:WU00:FS01:0x22:    Options: -std=gnu++98 -O3 -funroll-loops
12:47:27:WU00:FS01:0x22:   Platform: linux2 4.9.87-linuxkit-aufs
12:47:27:WU00:FS01:0x22:       Bits: 64
12:47:27:WU00:FS01:0x22:       Mode: Release
12:47:27:WU00:FS01:0x22:************************************ System ************************************
12:47:27:WU00:FS01:0x22:        CPU: Intel(R) Xeon(R) CPU E5450 @ 3.00GHz
12:47:27:WU00:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 23 Stepping 6
12:47:27:WU00:FS01:0x22:       CPUs: 8
12:47:27:WU00:FS01:0x22:     Memory: 15.66GiB
12:47:27:WU00:FS01:0x22:Free Memory: 11.13GiB
12:47:27:WU00:FS01:0x22:    Threads: POSIX_THREADS
12:47:27:WU00:FS01:0x22: OS Version: 4.18
12:47:27:WU00:FS01:0x22:Has Battery: false
12:47:27:WU00:FS01:0x22: On Battery: false
12:47:27:WU00:FS01:0x22: UTC Offset: -4
12:47:27:WU00:FS01:0x22:        PID: 30636
12:47:27:WU00:FS01:0x22:        CWD: /var/lib/fahclient/work
12:47:27:WU00:FS01:0x22:         OS: Linux 4.18.0-15-generic x86_64
12:47:27:WU00:FS01:0x22:    OS Arch: AMD64
12:47:27:WU00:FS01:0x22:********************************************************************************
12:47:27:WU00:FS01:0x22:Project: 11733 (Run 0, Clone 885, Gen 4)
12:47:27:WU00:FS01:0x22:Unit: 0x000000058ca304f15c8acee13eb8b439
12:47:27:WU00:FS01:0x22:Reading tar file core.xml
12:47:27:WU00:FS01:0x22:Reading tar file integrator.xml
12:47:27:WU00:FS01:0x22:Reading tar file state.xml
12:47:27:WU00:FS01:0x22:Reading tar file system.xml
12:47:27:WU00:FS01:0x22:Digital signatures verified
12:47:27:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
12:47:27:WU00:FS01:0x22:Version 0.0.1
12:47:30:WU00:FS01:0x22:Completed 0 out of 5000000 steps (0%)
12:47:30:WU00:FS01:0x22:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
12:48:24:WU00:FS01:0x22:Completed 50000 out of 5000000 steps (1%)
12:49:18:WU00:FS01:0x22:Completed 100000 out of 5000000 steps (2%)
12:50:13:WU00:FS01:0x22:Completed 150000 out of 5000000 steps (3%)
12:51:07:WU00:FS01:0x22:Completed 200000 out of 5000000 steps (4%)
12:52:01:WU00:FS01:0x22:Completed 250000 out of 5000000 steps (5%)
12:52:56:WU00:FS01:0x22:Completed 300000 out of 5000000 steps (6%)
12:53:50:WU00:FS01:0x22:Completed 350000 out of 5000000 steps (7%)
12:54:44:WU00:FS01:0x22:Completed 400000 out of 5000000 steps (8%)
12:55:38:WU00:FS01:0x22:Completed 450000 out of 5000000 steps (9%)
12:56:32:WU00:FS01:0x22:Completed 500000 out of 5000000 steps (10%)
12:57:28:WU00:FS01:0x22:Completed 550000 out of 5000000 steps (11%)
12:58:22:WU00:FS01:0x22:Completed 600000 out of 5000000 steps (12%)
12:59:16:WU00:FS01:0x22:Completed 650000 out of 5000000 steps (13%)
13:00:10:WU00:FS01:0x22:Completed 700000 out of 5000000 steps (14%)
13:01:04:WU00:FS01:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
13:01:04:WU00:FS01:0x22:Following exception occured: Particle coordinate is nan
I am seeing similar issues with a GTX 1060 3GB card in my other Ubuntu 18.04 LTS box.
Image
JimF
Posts: 651
Joined: Thu Jan 21, 2010 2:03 pm

Re: Core 22 on Advanced

Post by JimF »

JimF wrote:I can confirm the AMD variable loads problem on my RX 570 (Win7 64-bit, 18.9.3 drivers), driven by two free cores of an i7-4771.
This card is undervolted to 1.000 volts (default 1.150 volts) for reduced power and heat, and it is overclocked to 1344 MHz (default 1244 MHz).
On Core 21, it usually operates at 100% GPU load and 270 to 330 k PPD, and about 95 watts power at 72 to 74 C (as measured by GPU-Z).

But with Core 22 and P11733, it is getting around 200 k PPD, the lowest I have ever seen, with a 67% average GPU load (highly variable), 85 watts power at 70 C.
It needs to work harder.
It finally failed, probably due to the overclock. It never fails on Core 21, but with the load so variable on Core 22, it very likely spiked up to the point of failure.
Everyone needs to be (and soon will be) aware of this.
foldy
Posts: 2040
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Core 22 on Advanced

Post by foldy »

I guess it is a little too early for Core 22 or p11733 to be advanced
JimF
Posts: 651
Joined: Thu Jan 21, 2010 2:03 pm

Re: Core 22 on Advanced

Post by JimF »

It is a great way to learn what is wrong the fast way. But I will await the next version before doing more.
rafwiewiora
Scientist
Posts: 165
Joined: Mon Aug 03, 2015 8:23 pm
Location: New York

Re: Core 22 on Advanced

Post by rafwiewiora »

With the explosion of problems between beta and advanced (which is only a 2.5x increase from ~400 --> ~ 1000 GPUs), this unfortunately might be the only way to learn - I worry we might not have a diverse enough representation of systems in beta right now. Anyways, it's off advanced now, thank you all for your sacrifices. :) I get nice science logs with system configurations for all the failures now, so we'll go from there, see you with the next version!
bcavnaugh
Posts: 147
Joined: Tue Apr 30, 2013 1:39 pm

Re: Core 22 on Advanced

Post by bcavnaugh »

That is ok with me I can test under Advanced being we cannot become Beta Testers. :mrgreen:
Foxbat
Posts: 94
Joined: Wed Dec 05, 2007 10:23 pm
Hardware configuration: Apple Mac Pro 1,1 2x2.66 GHz Dual-Core Xeon w/10 GB RAM | EVGA GTX 960, Zotac GTX 750 Ti | Ubuntu 14.04 LTS
Dell Precision T7400 2x3.0 GHz Quad-Core Xeon w/16 GB RAM | Zotac GTX 970 | Ubuntu 14.04 LTS
Apple iMac Retina 5K 4.00 GHz Core i7 w/8 GB RAM | OS X 10.11.3 (El Capitan)
Location: Michiana, USA

Re: Core 22 on Advanced

Post by Foxbat »

That's why we set the client-type to advanced… Glad we could help!
Image
ra_alfaomega
Posts: 16
Joined: Tue Aug 23, 2011 9:00 am
Hardware configuration: Intel 3930k@4,5Ghz
Kingston HyperX 8Gb 1866@2000Mhz
Nvidia gtx460 Hawk

Re: Core 22 on Advanced

Post by ra_alfaomega »

If on core 21 I have the 1070 stable 100% at 2088 mhz, running core 22 I have to drop to 1961 with the same voltage, and I'm not sure if it's 100% stable. Also temperature is higher with 7-8C.
ra_alfaomega
Posts: 16
Joined: Tue Aug 23, 2011 9:00 am
Hardware configuration: Intel 3930k@4,5Ghz
Kingston HyperX 8Gb 1866@2000Mhz
Nvidia gtx460 Hawk

Re: Core 22 on Advanced

Post by ra_alfaomega »

Maybe this core 22 is not 100% stable, because there are people with no overclocking that have errors. Or, this core it's so heavy gpu intensive that is testing the limit of our cards. The right answer will be revealed soon.
JimF
Posts: 651
Joined: Thu Jan 21, 2010 2:03 pm

Re: Core 22 on Advanced

Post by JimF »

ra_alfaomega wrote:Maybe this core 22 is not 100% stable, because there are people with no overclocking that have errors.
A lot of cards are factory overclocked for the gamers. That is still overclocked insofar as FAH is concerned.

You may need to reduce the clock; check the Nvidia specs to see what the specified clock is.
For the GTX 1070, the base clock is specified at 1506 MHz, but in operation it will be boosted beyond that. Mine typically run from about 1800 to 1936 MHz.
https://www.geforce.com/hardware/deskto ... ifications
foldy
Posts: 2040
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Core 22 on Advanced

Post by foldy »

Core_22 beta and project 11733 are not stable yet. They will release an updated version soon. I currently expect when a GPU is Core_21 stable then it can run a fixed Core_22 version stable too. (Except your power supply cannot handle it then your PC freezes or shutdown)
Joe_H
Site Admin
Posts: 7927
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Core 22 on Advanced

Post by Joe_H »

Internal and Beta testing of Core 22 has show it will more fully utilize a GPU compared to Core 21. So a setup that is stable on current Core 21 projects may not be stable with this core. As mentioned there may be additional issues that this brief test on Advanced on a wider range of cards will reveal.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
bcavnaugh
Posts: 147
Joined: Tue Apr 30, 2013 1:39 pm

Re: Core 22 on Advanced

Post by bcavnaugh »

So can we now get Core 22 WU?
I would be happy to run on my GTX 1080 Ti and RTX 2080 Ti Cards even if I get a few Bad Work Units to at least see where I need to Set my Graphics Cards for Folding that is to support the new Core 22 Work Units.
bcavnaugh
Posts: 147
Joined: Tue Apr 30, 2013 1:39 pm

Re: Core 22 on Advanced

Post by bcavnaugh »

JimF wrote:
ra_alfaomega wrote:Maybe this core 22 is not 100% stable, because there are people with no overclocking that have errors.
A lot of cards are factory overclocked for the gamers. That is still overclocked insofar as FAH is concerned.

You may need to reduce the clock; check the Nvidia specs to see what the specified clock is.
For the GTX 1070, the base clock is specified at 1506 MHz, but in operation it will be boosted beyond that. Mine typically run from about 1800 to 1936 MHz.
https://www.geforce.com/hardware/deskto ... ifications
"That is still overclocked insofar as FAH is concerned."
This tells me that FAH no longer wants Gamers to Fold and I cannot believe this for on Second.
Post Reply