Any idea how many Core 16's are left?

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: Site Moderators, FAHC Science Team

Dr.G
Posts: 20
Joined: Sat Apr 26, 2014 7:11 am

Re: Any idea how many Core 16's are left?

Post by Dr.G »

See similar conversation at foldingforum thread:
viewtopic.php?f=18&t=26807&start=15
We're not alone!

@ EXT64 Thanks for the Linux tip! Can you suggest a distro please.
I understand that 'folding on AMD GPUs with Linux is problematic' as per Stanford guide. Is that correct?
https://folding.stanford.edu/home/guide ... all-guide/

@ ChasingTheDream
Hang in there :e( ... it's always been a 'rocky' path, but I bet just like the rest of us it's the challenge that excites
jrweiss
Posts: 704
Joined: Tue Dec 04, 2007 6:56 am
Hardware configuration: Ryzen 7 5700G, 22.40.46 VGA driver; 32GB G-Skill Trident DDR4-3200; Samsung 860EVO 1TB Boot SSD; VelociRaptor 1TB; MSI GTX 1050ti, 551.23 studio driver; BeQuiet FM 550 PSU; Lian Li PC-9F; Win11Pro-64, F@H 8.3.5.

[Suspended] Ryzen 7 3700X, MSI X570MPG, 32GB G-Skill Trident Z DDR4-3600; Corsair MP600 M.2 PCIe Gen4 Boot, Samsung 840EVO-250 SSDs; VelociRaptor 1TB, Raptor 150; MSI GTX 1050ti, 526.98 driver; Kingwin Stryker 500 PSU; Lian Li PC-K7B. Win10Pro-64, F@H 8.3.5.
Location: @Home
Contact:

Re: Any idea how many Core 16's are left?

Post by jrweiss »

I'm getting a mix of Core 16 and Core 17 WUs on my 2 AMD 7750 GPUs (2 different computers; both Win7-64, Client 7.4.4, Cat 14.4). Currently one is running a Core 17; the other just successfully completed a Core 17 and is now running a Core 16.

No complaint; just a data point or 2...
Ryzen 7 5700G, 22.40.46 VGA driver; MSI GTX 1050ti, 551.23 studio driver
Ryzen 7 3700X; MSI GTX 1050ti, 551.23 studio driver [Suspended]
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Any idea how many Core 16's are left?

Post by bruce »

jrweiss wrote:I'm getting a mix of Core 16 and Core 17 WUs on my 2 AMD 7750 GPUs (2 different computers; both Win7-64, Client 7.4.4, Cat 14.4). Currently one is running a Core 17; the other just successfully completed a Core 17 and is now running a Core 16.

No complaint; just a data point or 2...
Please list several of the project numbers that you've been getting on your 7750s. Do they use the same client-type?
jcoffland
Site Admin
Posts: 1018
Joined: Fri Oct 10, 2008 6:42 pm
Location: Helsinki, Finland
Contact:

Re: Any idea how many Core 16's are left?

Post by jcoffland »

We've made some configuration changes. You should find it easier to get 0x17 WUs now. Thanks for the reports.
Cauldron Development LLC
http://cauldrondevelopment.com/
LonePalm
Posts: 98
Joined: Thu Feb 26, 2009 7:27 pm
Location: Saint Marys, Georgia

Re: Any idea how many Core 16's are left?

Post by LonePalm »

7im wrote:
LonePalm wrote:Any chance of getting the Core 17 bonuses applied to the Core 16 units?
No, sorry.
OK, fair enough.

Is there any way to NOT get core 16 units?
Image
jrweiss
Posts: 704
Joined: Tue Dec 04, 2007 6:56 am
Hardware configuration: Ryzen 7 5700G, 22.40.46 VGA driver; 32GB G-Skill Trident DDR4-3200; Samsung 860EVO 1TB Boot SSD; VelociRaptor 1TB; MSI GTX 1050ti, 551.23 studio driver; BeQuiet FM 550 PSU; Lian Li PC-9F; Win11Pro-64, F@H 8.3.5.

[Suspended] Ryzen 7 3700X, MSI X570MPG, 32GB G-Skill Trident Z DDR4-3600; Corsair MP600 M.2 PCIe Gen4 Boot, Samsung 840EVO-250 SSDs; VelociRaptor 1TB, Raptor 150; MSI GTX 1050ti, 526.98 driver; Kingwin Stryker 500 PSU; Lian Li PC-K7B. Win10Pro-64, F@H 8.3.5.
Location: @Home
Contact:

Re: Any idea how many Core 16's are left?

Post by jrweiss »

bruce wrote:
jrweiss wrote:I'm getting a mix of Core 16 and Core 17 WUs on my 2 AMD 7750 GPUs (2 different computers; both Win7-64, Client 7.4.4, Cat 14.4). Currently one is running a Core 17; the other just successfully completed a Core 17 and is now running a Core 16.

No complaint; just a data point or 2...
Please list several of the project numbers that you've been getting on your 7750s. Do they use the same client-type?
Core 17: 13001 (many), 13000, 9201, 9406 (a couple)
Core 16: All 11293

Different brand cards, otherwise configured identically.
Ryzen 7 5700G, 22.40.46 VGA driver; MSI GTX 1050ti, 551.23 studio driver
Ryzen 7 3700X; MSI GTX 1050ti, 551.23 studio driver [Suspended]
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Any idea how many Core 16's are left?

Post by bruce »

LonePalm wrote:Is there any way to NOT get core 16 units?
No, but that should happen automatically going forward.

See viewtopic.php?f=24&t=26794&p=269293#p269293
LonePalm
Posts: 98
Joined: Thu Feb 26, 2009 7:27 pm
Location: Saint Marys, Georgia

Re: Any idea how many Core 16's are left?

Post by LonePalm »

bruce wrote:
LonePalm wrote:Is there any way to NOT get core 16 units?
No, but that should happen automatically going forward.

See viewtopic.php?f=24&t=26794&p=269293#p269293
Thank you very much.
Image
PS3EdOlkkola
Posts: 177
Joined: Tue Aug 26, 2014 9:48 pm
Hardware configuration: 10 SMP folding slots on Intel Phi "Knights Landing" system, configured as 24 CPUs/slot
9 AMD GPU folding slots
31 Nvidia GPU folding slots
50 total folding slots
Average PPD/slot = 459,500
Location: Dallas, TX

Re: Any idea how many Core 16's are left?

Post by PS3EdOlkkola »

@ChasingTheDream: I agree that your CPU temps are well within reason and shouldn't be causing any system reset issues. I have a few water-cooled rigs and found that with SMP folding I also need to install a fan blowing straight down on the water block in order to cool accessory components positioned around the CPU (voltage regulators, capacitors, etc). I burned out both an Asus and Gigabyte boards before figuring out those components need accessory cooling as well. As they were starting to fail, the systems exhibited the same behavior your describe -- runs for 5 to 10 minutes, then poof -- BSOD and reset.

Two other areas that caused me a few issues (BSOD's and system freezes) were not enough amperage on the rail feeding the GPUs and a very marginal DDR 3 memory module. Since the Core 16 work units draw more power than Core 17 units, the Core 16's could be outstripping your power supply's capabilities, or causing a marginal power supply to fail with the extra load. If you have an extra power supply laying around, you might try swapping it into a rig with the AMD GPUs and see if that makes a difference. BTW, I've learned my lesson on using less-than brand-name power supplies and always stick with the top-tier manufacturers. The only two I use are Seasonic and Corsair, and I over-spec the power supply by a minimum of 300 watts over-and-above the combined fully-utilized expected power draw; tends to make the PS run cooler and have more reserve amperage on the GPU rail. I hate to admit I made this mistake, but you may want to check to make sure all the cables running between the power supply the the GPU's are firmly seated, particularly with fully modular power supplies since both end of the cable need to be secured.

To check out any potential issues with a DDR memory module, consider running Prime95 for 24 hours, or even the Microsoft memtest app to see if you might have a marginal system memory module. The additional heat generated by Core 16's (unless you have terrific case cooling), could be just enough to push the module over the edge to failure mode. Since Core 16's also use more CPU time, they tend to use marginally more system memory and I/O, which could be just enough to hit that one address line on one chip on one module that happens to be the guilty party.

The reason I'm thinking there is something else wrong in your system that doesn't have to do with the FAH software or AMD drivers is because I'm running two systems with 6 AMD GPUs in each system and through the spate of Core 16's, not one work unit failed to be processed, nor did any of my systems fail or reset (I'm using Corsair AX1500i power supplies in both those systems). However, it pained me to see systems that each produce about 1 million points a day running Core 17's drop down to about 20,000 PPD running Core 16s, but that's the way it goes sometimes.

I completely understand your frustration with having systems that don't run reliably. It makes participating in FAH "challenging" at times, but hang in there -- diagnose problems one step at a time and you'll narrow it down to the one component (or two) that's driving you crazy.
Image
Hardware config viewtopic.php?f=66&t=17997&p=277235#p277235
ChasingTheDream
Posts: 56
Joined: Mon Jun 02, 2014 10:56 pm

Re: Any idea how many Core 16's are left?

Post by ChasingTheDream »

@ Dr.G
I read the thread. Thanks for the link.

@ PS3EdOlkkola

Thanks for the detailed post and it makes perfect sense. These machines have had some lengthy troubleshooting battles before so I'll give a quick list of what has been checked when I was running into the same issues with Core 17 WU's and a driver update eventually took care of it.

The components in each machine are identical. Same manufacturer as well, so I mean literally identical.

1) under clock the GPU's
2) under clock the RAM.
3) disable CPU folding.
4) change GPU's from machine to machine.
5) change GPU slots in the machines.
6) tried different power plugs on the Corsair PSU's to rule out bad plugs. The PSU's in all three machines are Corsair AX1200i (single rail).
7) change the BIOS so PCIE2 is used instead of PCIE3.
8) updated motherboad BIOS to latest version.
9) ran stress tests on the RAM (no errors found).
10) swap memory from machine to machine.
11) reinstalled Windows.
12) after a fresh Windows install went straight to 14.4 (at the time). I just did a clean Windows install a couple days ago and went to 14.7RC3 and then to 14.9.
13) tried over-volting the GPU's and motherboard.

After all these things every machine behaved exactly the same. Nothing made any difference at all and the memory tests showed nothing wrong. I think I even ran some CPU stress tests. I really don't want to run though all that again. At the time since no real solution could be found I pulled GPU's. When 14.7RC3 was released I had a little free time and on a whim decided to check to see if the machines could handle more GPU's and suddenly they could. Core 17's ran without incident for days. They wouldn't run 10 minutes in a multi-GPU system prior to the 14.7RV3 release though. I need to be careful when i say they wouldn't run for X minutes though because sometimes they would run for a hour or more and then die five times in a row immediately. I guess I should say 10 minutes on average. The machines are behaving the same way with the Core 16's now.

So I'm not sure what about Core 16 would be different enough to cause the same issues that wouldn't have been found in the first round of tests. I could disable CPU testing on a machine to see if it makes any difference with the surrounding components just to be sure that isn't it.

Regarding airflow, each motherboard is in a Corsair Carbide Series Air 540 ATX Cube Case with both side off. They get a lot of airflow.

The frustration level is high because there are always going to be issues with various cores on some hardware and at the moment we have no way to deal with them. If I had my way, I would dump my GPU's and buy some GTX 970's, but the wife has informed me in no uncertain terms that if I do that I'll lose a body part I'm rather fond of. :D So I'm going to have to make due with what I have.
PS3EdOlkkola
Posts: 177
Joined: Tue Aug 26, 2014 9:48 pm
Hardware configuration: 10 SMP folding slots on Intel Phi "Knights Landing" system, configured as 24 CPUs/slot
9 AMD GPU folding slots
31 Nvidia GPU folding slots
50 total folding slots
Average PPD/slot = 459,500
Location: Dallas, TX

Re: Any idea how many Core 16's are left?

Post by PS3EdOlkkola »

One more suggestion: My AX1500i power supplies came configured with power split on individual rails, not just one strong rail. I had to use the USB cable that comes with the PS to connect the PS to an internal USB header, then download the CorsairLink software and reconfigure the PS for all power applied to a single rail. The system will exhibit the same characteristics as you describe --- system resets and BSOD's if the GPUs are powered by individual rails instead of a single rail. The GPU's you're using are very power hungry and might just get under the threshold on a Core 17, but pushed harder with Core 16 will exceed the limit of the PS configured to split amperage between GPU power lines.

You've certainly done a lot of primary system testing. I'd suggest zip-tie a 30+ CFM fan inside your case and point it directly on the heat sinks surrounding the CPU.

[EDIT]: I use a free utility that has become indispensable for looking at hardware settings and temperatures inside my systems. It provides a complete system audit and provides real-time statistics on processor load, CPU, GPU and motherboard temps. It will also give you real-time power supply voltages, DRAM configurations (current timings, plus all XMP and JDEC modes supported). I've found it very useful to help track down issues; this was the app that helped me determine I needed a fan blowing directly on the heat sinks surrounding the CPU. Here's the link the the utility: https://www.piriform.com/speccy
Image
Hardware config viewtopic.php?f=66&t=17997&p=277235#p277235
ChasingTheDream
Posts: 56
Joined: Mon Jun 02, 2014 10:56 pm

Re: Any idea how many Core 16's are left?

Post by ChasingTheDream »

@ PS3EdOlkkola

Thanks for the additional information! I'll look into them over the weekend.

On an odd note. When jcoffland posted yesterday I started up my multi-GPU (three GPU's) computers again with all fresh WU's and there have been no locks or BSD's since I restarted them about 24 hours ago. I made no changes to them at all and I have seen Core 16's WU's on each machine. In fact, one of the machines had two Core 16's running at the same time and I figured it was dead. It had no issues at all so something changed but I can't imagine they did anything to the WU's themselves. Maybe the gremlins decided to leave my computers for awhile.

Edit: Well as usual when you say something like I did above like "oh it's working" I had an immediate system lock up again and again and again. I couldn't even get that particular Core 16 WU to 1%. In any event the quest for stability continues...
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Any idea how many Core 16's are left?

Post by bruce »

ChasingTheDream wrote:It had no issues at all so something changed but I can't imagine they did anything to the WU's themselves.
I would make the same assumption.

The only other thing that I'd GUESS about has to do with the percentage of WUs that may turn out to be bad WUs. Some percentage of WUs cannot be completed and are discarded as "bad WUs" after retrying several times. They have to be retried several times because a WU can also fail due to overclocking or bad drivers or other reasons, but they'll finish successfully on somebody else's system. You would only get them once, so as an individual, you don't have access to enough information to be able to diagnose this sort of thing. ("Several" is a server-defined number that might have changed, but the objective is to finish good WUs while weeding out bad WUs without minimal inconvenience to everyone.)
ChasingTheDream
Posts: 56
Joined: Mon Jun 02, 2014 10:56 pm

Re: Any idea how many Core 16's are left?

Post by ChasingTheDream »

@ Bruce

Thanks for the information. I did find another interesting anomaly.

I've been grinding away on a Core 16 WU for several days on one of my single GPU systems. The hardware in that machine is the oldest I've got but it still uses the same R9 290X TRI-X GPU. In any event, although this machine was old it was reliable. In fact, it would run the Core 16's without incident but it is only a single GPU system. After I upgraded to the latest AMD 14.9 drivers I found the system operated at about 1/5th the speed it should be running it. I thought it was the WU it was working on but I let it finish. When it finally finished it picked up another Core 16 WU and was still only running at 10-20% of GPU utilization. So today I dropped that machine back to the 14.7RC3 driver and it is back to full speed again.

The latest drivers are nice, but there are obviously still hardware combinations they are having problems with.
Dr.G
Posts: 20
Joined: Sat Apr 26, 2014 7:11 am

Re: Any idea how many Core 16's are left?

Post by Dr.G »

@ ChasingTheDream
I agree that when using the newest AMD 14.9 drivers on Core 16 WU the completion speed is very slow.... and I'm still getting Core 16 WU's today on my R9 290X's.

However that brings us back to your original question: why are we 'suddenly' getting more Core 16 WU's on high end AMD systems and others Core 15 WU on high end NV systems (which the newer systems can't 'handle'.)

@ Mods / All
The coincidence of the new AS starting around the same time as stability problems arising seems overwhelming. viewtopic.php?f=24&t=26794&p=269293#p269293

In other words- new AS sends out 'old' WU to systems which have newer hardware components and / or drivers creating stability problems. The image I use is that many newer systems are a bit like thoroughbred racehorses, but that makes them very sensitive too...

See also Dr. Pande's own comments from today, 04 Oct 2014.... viewtopic.php?f=18&t=26807&start=15

I hope I'm not speaking out of turn. I am very dedicated and fond of folding.
Please, this isn't a complaint, I just wish for our system to work efficiently and without crashing- it's very frustrating / time consuming...

Am thinking of trying a Linux system, but will take time... any advice appreciated, thanks!
Post Reply