Page 1 of 1

What about OpenCL?

Posted: Mon Jul 05, 2010 12:16 am
by theteofscuba
Edit by bruce:
This topic originated here:http://foldingforum.org/viewtopic.php?f ... 15#p150293
bruce wrote:Our hopes for multi-gpu support depend on whether OpenCL lives up to its expectations. From what little I understand about it, at the present time, if one unit of work is presented to a nVidia GPU through CUDA, it takes, say, one unit of time. If the same unit of work is presented to an ATI GPU through Brook/CAL, it takes somewhat longer. If the same unit of work were presented to an arbitrary GPU through OpenCL, it would take much, much, much longer to process. There's no way to know when the OpenCL consortium and the manufacturers of drivers for brand-X GPUs will be able to develop code that is sufficiently optimized to be truly productive, but at present, OpenCL is more of an advertising tool than a productive computing platform.
You don't need to worry if nvidia gpu is a little faster than ati gpu on some things or vice versa. its expected that they will perform some what differently. What we are seeing is very similar to the way that the OpenGL franchise has turned out to be - write once source code and then compile for multiple processor without changing the application code. OpenCL will likely be around for a long time to come given it is so widely supported already.

It does worry me that some OpenCL backends will be lazily implemented, and then users of FAH blame PG instead of the hardware vendor's poor implementation. Is the state of things *that* bad?

Re: wrapper of F@H for BOINC?

Posted: Mon Jul 05, 2010 12:27 am
by bruce
theteofscuba wrote:It does worry me that some OpenCL backends will be lazily implemented, and then users of FAH blame PG instead of the hardware vendor's poor implementation. Is the state of things *that* bad?
Maybe. You'd have to ask somebody with more detailed information than I have. . . and I do not speak for the Pande Group, these are my own opinions.

A lot depends on whether a general purpose language like OpenCL can produce optimized code for the wide variety of ways that GPU hardware can be structured. How wide is the data path between CPU and GPU? How much data needs to be moved before/after the GPU can start/finish a program segment? How can you keep the GPU busy while a new block of data is being delivered? Are operations performed by multiple narrow processors or fewer super-wide processors?

Compilers are not very effective at scheduling parallel data and the task is nearly impossible without a detailed knowledge of the hardware. Experienced programmers still do a MUCH better job than compilers.

Read the history of Intel's HyperThreading. What's the relationship between a long pipeline and idle cycles. How can instructions be reordered to keep the processor busy? Now multiply that problem by some high-order multiplier. The pipeline runs at PCI-e speeds (much slower) and the GPU can handle a huge number of parallel operations (much faster for properly scheduled data, once it gets there).

Re: wrapper of F@H for BOINC?

Posted: Mon Jul 05, 2010 12:32 am
by John Naylor
Advance warning: I'm making a lot of guesses here. I also do not speak for the Pande Group.

I don't think its a problem of poor implementation so much as the OpenCL specification itself is not yet fast enough to compete with the optimised platforms that are CUDA and CAL. OpenCL 1.1's specification has performance improvements as a stated aim. With so many partners onboard it may be some time before OpenCL can compete with native CUDA/CAL for application speed, especially as OpenCL on nVidia is implemented using CUDA anyway. Hopefully it will eventually be like OpenGL (in the sense of only needing one set of code), and hopefully it will be fast enough to be feasible for the project. There's a lot of variables at play here, and the Pande Group will want to use whichever option produces correct results fastest.

Judging by the Pande Group's choice of using OpenMM on CUDA first, at the expense of supporting ATI with OpenMM, OpenCL support (either OpenCL itself or in OpenMM's support of OpenCL) is not yet up to scratch. When it can produce accurate results roughly as fast as the current GPU clients, then we'll see FAH on OpenCL.

Anyway, we're a bit off-topic here. ;)

Re: What about OpenCL?

Posted: Mon Jul 05, 2010 12:53 am
by bruce
John Naylor wrote:Anyway, we're a bit off-topic here. ;)
Topic split.

Re: wrapper of F@H for BOINC?

Posted: Mon Jul 05, 2010 1:37 am
by theteofscuba
John Naylor wrote:Advance warning: I'm making a lot of guesses here. I also do not speak for the Pande Group.

I don't think its a problem of poor implementation so much as the OpenCL specification itself is not yet fast enough to compete with the optimised platforms that are CUDA and CAL. OpenCL 1.1's specification has performance improvements as a stated aim. With so many partners onboard it may be some time before OpenCL can compete with native CUDA/CAL for application speed, especially as OpenCL on nVidia is implemented using CUDA anyway. Hopefully it will eventually be like OpenGL (in the sense of only needing one set of code), and hopefully it will be fast enough to be feasible for the project. There's a lot of variables at play here, and the Pande Group will want to use whichever option produces correct results fastest.

Judging by the Pande Group's choice of using OpenMM on CUDA first, at the expense of supporting ATI with OpenMM, OpenCL support (either OpenCL itself or in OpenMM's support of OpenCL) is not yet up to scratch. When it can produce accurate results roughly as fast as the current GPU clients, then we'll see FAH on OpenCL.

Anyway, we're a bit off-topic here. ;)

CUDA and OpenCL share some common aspects that can be ported easily. http://developer.amd.com/documentation/ ... .aspx#four

Correct me if I'm wrong, but I suspect that OpenMM+GPU3 is using CUDA because OpenMM is based on GPU2 and that was implemented with CUDA already so they know they have some reliable code to work with. My guess is that it is only CUDA at this point because of PG's stated intention to abandon AMD's Brook (CAL?) entirely. It just does not really provide good performance for newer ATI cards

Re: What about OpenCL?

Posted: Mon Jul 05, 2010 7:21 am
by cristipurdel
bruce wrote: If the same unit of work were presented to an arbitrary GPU through OpenCL, it would take much, much, much longer to process.
Not quite. It depends on the WU. I saw the same WU being processed with CAL+Brook+ and OpenCL and there was almost no difference.
The code was used at seti beta(yes, for simple functions, like 2+2 it works).
bruce wrote: OpenCL is more of an advertising tool than a productive computing platform.
That's what they said last year about android, and I think it turned out ok :)

http://oscarbg.blogspot.com/2010/07/ati ... admap.html
If I'm reading this right, Cat 10.12 will have built in OpenCL, so the ATI client will be around new years eve, hopefully with much needed improvements.

Re: What about OpenCL?

Posted: Mon Jul 05, 2010 11:09 am
by John Naylor
I think expecting a new client so soon is to be setting yourself up for disappointment.... nVidia already have OpenCL built in to their drivers and the OpenMM core isn't using that functionality yet so there is no reason to assume the ATI OpenCL client will arrive so soon either. Sorry :(

Re: What about OpenCL?

Posted: Mon Jul 05, 2010 1:29 pm
by cristipurdel
John Naylor wrote:I think expecting a new client so soon is to be setting yourself up for disappointment.... nVidia already have OpenCL built in to their drivers and the OpenMM core isn't using that functionality yet so there is no reason to assume the ATI OpenCL client will arrive so soon either. Sorry :(
If you are implying that ati opencl will not be here before x-mas.....well...good luck with all that fah.

Re: What about OpenCL?

Posted: Mon Jul 05, 2010 1:42 pm
by John Naylor
I mean that the ATI/OpenCL FAH client won't be ready by xmas. I've no reason to doubt that ATI will have included OpenCL in their drivers by then.

Re: What about OpenCL?

Posted: Mon Jul 05, 2010 4:36 pm
by codysluder
cristipurdel wrote:
bruce wrote: If the same unit of work were presented to an arbitrary GPU through OpenCL, it would take much, much, much longer to process.
Not quite. It depends on the WU. I saw the same WU being processed with CAL+Brook+ and OpenCL and there was almost no difference.
The code was used at seti beta(yes, for simple functions, like 2+2 it works).
I don't think anybody disputes that OpenCL can do 2+2. I though the question was whether it could do FAH. Picking an arbitrary but reasonably common FAH platform, let's assume you have a Quad with a GPU. If you're running the SMP client, SSE can do four FPOPs simultaneously so four FPUs can do 16. Gromacs optimizations do a good job of keeping them all busy. The implicit solvent version in fahcore11 using CUDA does a good job of keeping the shaders busy and makes good use of the FLOPS that the GPU can do.

Just how effective will OpenCL be? If it can only move data to a GPU fast enough that the average number of simultaneous floating point ops is less than 16, it's worthless. It's not obvious that the OpenCL code can feed it data fast enough to keep the GPU fully loaded. Since the CUDA code can do it, OpenCL has the potential to do that, too. There's no proof that it can do it yet.

Re: What about OpenCL?

Posted: Mon Jul 05, 2010 5:27 pm
by VijayPande
For the foreseeable future, we will handle multiple GPUs by running a core on each GPU (since running on multiple GPUs for a single MD trajectory will slow performance). The v7 client will automate this process to make it easier on donors.

I'm hoping we can get an ATI/OpenCL client out soon, although a lot of that rests on optimizations in the OpenCL code. RIght now, OpenCL is slower than CUDA on both the ATI and NVIDIA platforms. While we could release an OpenCL client for ATI, the PPD would be poor and there would be a lot of upset donors. We are working with ATI and NVIDIA to speed our OpenCL code. When we're getting performance on the level of what donors would expect, we'll release a client.

By the way, our OpenMM code is completely open (of course), so you can check it out and see how it's progressing (http://simtk.org/home/openmm).

Re: What about OpenCL?

Posted: Mon Jul 05, 2010 10:53 pm
by theteofscuba
VijayPande wrote:
I'm hoping we can get an ATI/OpenCL client out soon, although a lot of that rests on optimizations in the OpenCL code. RIght now, OpenCL is slower than CUDA on both the ATI and NVIDIA platforms. While we could release an OpenCL client for ATI, the PPD would be poor and there would be a lot of upset donors. We are working with ATI and NVIDIA to speed our OpenCL code. When we're getting performance on the level of what donors would expect, we'll release a client.
Hopefully ATI is working on something really complex with promising bennefits. nvidia may have an excuse to push users to use CUDA and make OpenCL less appealing. have there been any attempt to target AMD and intel CPU with OpenCL?


EDIT:

AMD showing off how OpenCL can scale accross CPU cores: http://www.youtube.com/watch?v=7PAiCinmP9Y

Re: What about OpenCL?

Posted: Tue Jul 06, 2010 4:07 am
by bruce
theteofscuba wrote:...have there been any attempt to target AMD and intel CPU with OpenCL?


EDIT:

AMD showing off how OpenCL can scale accross CPU cores: http://www.youtube.com/watch?v=7PAiCinmP9Y
The same sort of demonstration can be done with FAH's SMP client. For a number of years now, it has been a cutting-edge developement project specifically to targeted at folding on multiple CPU cores. I can't think of any reason why the Pande Group would target the same hardware with a less-optimized solution using OpenCL. (I have no doubt that SMP does a better job of maximizing FAH's performance on multiple CPUs that OpenCL can.) For developers who have not already optimized and debugged their own solution, the OpenCL approach might be cheaper/faster than starting their own custom development project.

Re: What about OpenCL?

Posted: Tue Jul 06, 2010 6:05 am
by theteofscuba
OpenCL is not inherently slow. The resulting code is only as good as what the compiler can produce, and there is no one single compiler available.

ATI drivers ship with Microsoft Visual C++ 2008 redistributable. That is a very good compiler.

Why should they bother? The point is to maintain a single code base that runs on all hardware for which there is an OpenCL implementation for. I'm not a big fan of fixing something that isn't broken, but basically you can write code once and run it anywhere on one or more CPUs, one or more GPU. it isn't just limited to x86 based CPU either, there are many hardware vendors (i.e. IBM, Fujitsu/Oracle) that offer OpenCL implementations as well. This could mean that you might see Protomol* run on the GPU without having to maintain two different source trees.


*If an algorithm can't make use of a lot of parallel processors, then try mashing in as many additional, separate instances as you can fit in one go.