New: Need confirmation How to run multiple threads on a GPU?

StitchExperimen · Post by **StitchExperimen** » Sun Jul 07, 2013 7:12 pm

HaloJones wrote:(Transposing sentence.) Soon that will change and your 7950 will get ten times more points than you're seeing. And as for the points you're seeing, with the standard production clients, Nvidia has historically had better support for Folding than ATI/AMD. .

That stands out more sense wise than the other comments, ie PrimeGrid is OpenCL and uses all of the Radeon processing capability on one wu. Where Help Conquer Cancer one wu didn't tax the Radeon GPU.

Other peoples comments are I think from the side of Nvidia which I agree will only process one wu at a time.
However from my comments previous and below the Radeon behaves differently in Open CL.

However the comment of using Open CL with Nvidia is puzzling because my hands on usage on GPU cards in the $300 range Nvidia is ~1/4-1/5 as slow comparing a Galaxy factory o/c 660 Ti to a HD 7950 in Open CL. I have heard of FahCore 17 but just the mention of it. Also of notice for comments later you can NOT run two wu of Help Conquer Cancer on an Nvidia and get significant decrease of run times divided by two but you can run the two work units at the same time with a slight offset and both will process and finish. Finishing times on the Nvidia totaled longer, it didn't reduce the time for completion like the Radeon GPU.

The following quote doesn't always hold true:
As you have noticed in my previous comments I have run 14 parallel wu in one program which decreased times down to 30 seconds divided by the total of work units. The most to be effective was 8 or 9 max wu at the same time. Again this is with the HD 7950 not a Nvidia cards as described above.

"You're misunderstanding the architecture here. GPUs don't have any concept of "threads". GPUs parallelize the work that they are given, and most of the tasks that they are given are embarrassingly parallel. For example, in computer graphics, a GPU can render a polygon-based model at 60 FPS or faster because it has the ability to work with every vertex at the same time, and then every pixel/fragment in between at the same time. With F@h, a request for calculations comes in, is divided up and highly parallelized across all GPU cores, and is thus completed in parallel."

StitchExperimen · Post by **StitchExperimen** » Sun Jul 07, 2013 7:17 pm

7im wrote:
StitchExperimen wrote:
So How do you run multi threads so you can prove me wrong because you haven't said you tried it and the PPD seems to prove it.
I have tried everything at least once in my 10 years of folding. I don't have to prove anything. My word is good here, just as others have confirmed the same thing.

Read the Points FAQ as a primer as to why running more threads is not better for the project.

But feel free to waste your time and ignore good advise from seasoned veterans. The Windows Install Guide shows how to add extra slots. Add another GPU slot and watch your points drop to less than half.

Give me a bit to read. As I understand a Slot is a "graphics card" not another wu which is what I want to try is running two work units at the same time on a single AMD HD 7950. If you can point me in the right direction I'd apperciate that and give it a whirl.

Bruce G.

kiore · Post by **kiore** » Sun Jul 07, 2013 7:29 pm

StitchExperimen wrote:
Give me a bit to read. As I understand a Slot is a "graphics card" not another wu which is what I want to try is running two work units at the same time on a single AMD HD 7950. If you can point me in the right direction I'd apperciate that and give it a whirl.

Bruce G.

The slot here is the 'thing that does workunits' if you like. So normally each GPU would have a single slots as would a multicore CPU have 1 slot. You can create new slots and hope that they will link to you GPU, thus attempting what you say you want.
I think this is a missunderstanding on your part to try and run many workunits on one GPU, I don't want to encourage you to do this as you may be able to run 2 but running more than that likely cause failures.
As pointed out your experience on other projects is recognized, but the client here is not at all the same.

StitchExperimen · Post by **StitchExperimen** » Sun Jul 07, 2013 7:37 pm

Thanks,

I'll try to get two to run and extrapolate GPU % if it links together and see what happens to beginning ETA time and clock time.

Should keep me happy for now and outta your hair.

The other Bruce mohaha!

bollix47 · Post by **bollix47** » Sun Jul 07, 2013 7:46 pm

FYI

Running two GPU slots on one GPU will be even more of a problem with client-type advanced because you will get core_17 projects. One core_17 work unit will utilize your complete GPU and produce much higher PPD.

P5-133XL · Post by **P5-133XL** » Sun Jul 07, 2013 8:51 pm

Do note, Core_17 WU's are currently being issued using advanced. That may change over time and there is no guaranty that that will continue. All advanced indicates is that it is more stable than beta status and less stable than general release status.

That being said, If you have a AMD GPU, then you definitely want Core_17 WU's. They have none of the drawbacks of earlier cores in that they work well with newer AMD drivers; They do not use a full CPU core (For AMD only) and will maintain a high GPU %usage. Further, f you have a Passkey configured and qualified, they are going to produce far more points.

If you do not wish to run advanced, then the next best solution is to clean out your video drivers using a 3rd-party driver cleaner and install 12.8 (the last know good AMD driver for folding using core_16).

To answer your other questions:

We do not recommend multiple GPU slots on a single GPU. It can be done, but the v7 client is not designed for it so it is an outright pain to do. The WU's are time sensitive and splitting up the GPU into multiple WU's slows everything down. Much better is to run 12.8 or earlier drivers so you get a high %GPU usage on a singe WU (using conventional Core_16) or run advanced (with a more recent AMD driver) to get Core_17 WU's.

The number of threads is used for CPU WU's and determines the number of CPU cores used. It has little to do with the GPU except that for AMD running conventional Core_16 WU's needed a dedicated CPU core for itself. With Core_17, that is not an issue with AMD (It is now an issue with Nvidia, which didn't earlier though). I would recommend keeping the number of CPU threads at -1 (means the client automatically chooses) or if you wish you can manually fix it to match the number of cores your CPU has less one (if you are planning to run conventional GPU core_16 WU's). Whatever you do, do not set it higher than the number of cores your CPU has because that will definitely cause problems.

Napoleon · Post by **Napoleon** » Sun Jul 07, 2013 9:29 pm

StitchExperimen wrote:My GPU is a AMD HD 7950 it is getting low PPD and the GPU usage at best is 90%.
On Boinc, WCG, Help Conquer Cancer I would run 8 wu at a time and average 30 seconds time for all 8 wu compared to a single wu that took like 6 minutes. So how do I run multiple wu in your program on a AMD GPU HD 7950.

At present I have written in to the box below client-type, advanced but don't know what it does. Can someone give an explanation?

The other Bruce.

It has been explained elsewhere just recently: viewtopic.php?f=61&t=24519#p245167. EDIT: topics merged.
Since you asked for something to read, looking for a FAQ section is usually a good place to start:
http://folding.stanford.edu/home/faq/faq-points/
http://folding.stanford.edu/home/faq/faq-passkey/
http://folding.stanford.edu/home/faq/faq-best-practices

As pointed out by others already, you really want to make sure that your passkey is set up correctly, client-type is set to advanced in order to get core_17 WUs, and that you meet the rest of the GPU QRB criteria. I'd bet you'll lose interest in "trickery" once the QRB kicks in (if it hasn't already).

Oh, and Welcome To The F@h Forum, StitchExperimen.

Post by **bruce** » Sun Jul 07, 2013 10:28 pm

Efficient analysis software will use all of the resources available. If 2 WUs are forced to share the same resources, they'll compete with each other and run "half" as fast.

FAH's goal is to write highly efficient software. They do a very good job of it. Sometimes the hardware limitations or limitations in the drivers interfere with that efficiency, but the advice you've been given is accurate.

If you run a WU on a multi-threaded CPU, you have a choice of allocating all the resources to a single WU or allocating less. If you run a WU on a GPU you will be allocating the whole GPU to that process plus some amount of CPU processing to support data transfers.

Drivers for OpenCL for AMD/ATI and either CUDA or OpenCL for NVidia can be designed to use one full CPU thread or a small fraction of a CPU, depending entirely on how AMD or NVidia choose to construct their drivers. [A lot depends on whether GPU performance is measured in frame-rates for video or in GFLOPS for analysis]. Stanford is not in control of their designs nor do they support 3rd party drivers.

If you choose to run two WUs, one that uses SMP along side of another that uses a GPU, it's important to notice whether the GPU uses a full CPU thread or a small fraction. You do NOT want them to compete for the same CPU resources, so it makes sense to allocate fewer CPU threads to SMP if the GPU uses a lot The same would be true if you decided to run something like BOINC on the same CPUs. SMP is most efficient if it has (almost) exclusive use of the CPUs that you allocate to it and decidedly inefficient if it keeps getting interrupted for significant periods of time.

Post by **bruce** » Sun Jul 07, 2013 10:46 pm

Two topics on the same subject merged. Please review any posts that you didn't see before.

StitchExperimen · Post by **StitchExperimen** » Mon Jul 08, 2013 1:12 am

Thanks for taking the time to further elaborate.

So far here are the results I have experienced.

I have in the second slot GPU a Core_17 and it is sucking the life out of the GPU.

In the first slot GPU Core_16 the PPD says 1,582. Time to finish has decreased 1 1/2 hours over the past 5 hours.

Time lost in the Core_17 is about nill figuring proportions of 46.85 to 5 hours 7 minutes left and ~5 hours total processed.

So how two Core_16 work I don't know. But One Core_17 makes 60,000 PPD so far and takes ~10 hours on a HD 7950.

So whats the story on Core_16 will they be phased out?

Bruce G

And yes I plan on turning on only one slot. Will be interested in the story on wu's next... please. Thank you. Or do you have a link to the topic?

StitchExperimen · Post by **StitchExperimen** » Mon Jul 08, 2013 1:39 am

Napoleon wrote:
This web site link doesn't work.
Information

The requested topic does not exist.

It has been explained elsewhere just recently: viewtopic.php?f=61&t=24519#p245167. EDIT: topics merged.
Since you asked for something to read, looking for a FAQ section is usually a good place to start:
http://folding.stanford.edu/home/faq/faq-points/
http://folding.stanford.edu/home/faq/faq-passkey/
http://folding.stanford.edu/home/faq/faq-best-practices

I'd bet you'll lose interest in "trickery" once the QRB kicks in (if it hasn't already).

I'm just interested in production and a worthy cause pretty much. Stats are nice also but having hardware is something I relate to. AMD sells you the farm on giving away cores compared to Nvidia's cores. I root for AMD but end up buying intel. I mix a little Nvidia GPU in since it's been in development longer.

Oh, and Welcome To The F@h Forum, StitchExperimen.

Post by **bruce** » Mon Jul 08, 2013 3:42 am

StitchExperimen wrote:So whats the story on Core_16 will they be phased out?

As hardware progresses through new generations, the version of the internal analysis code is also upgraded to support more scientific features. This leads to a progression of improvements in the analysis code. Part of the research effort associate with FAH is developing better software tools to analyze protein folding. Over the years, they've developed a series of new FahCores. See http://fahwiki.net/index.php/Cores.

Even if Core_17 is intended to replace any old cores (and that's not a given) they generally continue running projects that were started on the old core (as long as it's still giving good answers, and as long as there are donors with the old hardware).

Some changes are a result of changes in support by the hardware manufacturer. ATI dropped support for CAL, so the ATI version of FahCore_11 was deprecated in favor of FahCore_16. There are still some WUs around for the NV FahCore_11 although the supply of G80 GPUs is dwindling. It's not possible to predict when Core_15/16 will be deprecated. It is likely to be a few years.

In the case of CPU cores, there still is some work for FahCore_78 although that core has been replaced by FahCore_a3/a4/a5 for any CPUs that support SSE2 ... and especially for CPUs that contain more than one CPUcore. (just how many Pentium2/3/4 or Athlon processors are still folding?)

StitchExperimen · Post by **StitchExperimen** » Mon Jul 08, 2013 12:57 pm

Have there been optimizations for the original AVX instruction set? Not AVX2.

CPU affinity locking what should this be set at? I have one chip with 18 threads. One GPU zero runing the monitor video and GPU 1 running F@H. With processes running at present I'm able to use a total of 9 threads to devote to cpu F@H.

7im · Post by **7im** » Mon Jul 08, 2013 1:25 pm

AVX is in the works, not in use yet.

CPU affinity usually does better when you let Windows set it, especially when mixing GPU and CPU clients, so don't use it unless you are prepared to do a lot of experimenting to find the best setting.

StitchExperimen · Post by **StitchExperimen** » Mon Jul 08, 2013 3:22 pm

Intel came up with their own tests on AVX2 at like 200-300% faster BUT from "one" persons analysis in Distributed Computing AVX2 code was not as viable as people hoped because the instruction set was not what were needed for DC and went on to give example register and code wise.

Folding Forum

New: Need confirmation How to run multiple threads on a GPU?

How do I run multiple threads on the GPU?

Re: How do I run multiple threads on the GPU? AMD 7950

Re: How do I run multiple threads on the GPU? AMD 7950

Re: How do I run multiple threads on the GPU? AMD 7950

Re: How do I run multiple threads on the GPU? AMD 7950

Re: new - need confirmation

Re: How do I run multiple threads on the GPU? AMD 7950

Re: How do I run multiple threads on the GPU? AMD 7950

Re: How do I run multiple threads on the GPU? AMD 7950

Re: New: Need confirmation How to run multiple threads on a

Re: How do I run multiple threads on the GPU? AMD 7950

Re: New: Need confirmation How to run multiple threads on a

Re: New: Need confirmation How to run multiple threads on a

Re: New: Need confirmation How to run multiple threads on a

Re: New: Need confirmation How to run multiple threads on a