Tag Archives: cuda

Adobe turns to OpenCL rather than NVIDIA CUDA for Mercury Graphics Engine in Creative Suite 6

Adobe has just announced Creative Suite 6. CS 5.5 used the Mercury Playback Engine in Premiere Pro, which takes advantage of NVIDIA’s CUDA library in order to accelerate processing when an NVIDIA GPU is present. Just to be clear, this is not just graphics acceleration, but programming the GPU to take advantage of its many processor cores for general-purpose computing.

Premiere Pro CS6 also uses the Mercury Playback Engine, and while CUDA is still recommended there is new support for OpenCL:

The Mercury Playback Engine brings performance gains to all the GPUs supported in Adobe Creative Suite 6 software, but the best performance comes with specific NVIDIA® CUDA™ enabled GPUs, including support for mobile GPUs and NVIDIA Maximus™ dual-GPU configurations. New support for the OpenCL-based AMD Radeon HD 6750M and 6770M cards available with certain Apple MacBook Pro computers running OS X Lion (v10.7x), with a minimum of 1GB VRAM, brings GPU-accelerated mobile workflows to Mac users.

PhotoShop CS6 also uses the GPU to accelerate processing, using the new Mercury Graphics Engine. The Mercury Graphics Engine uses the OpenCL framework, which is not specific to any one GPU vendor, rather than CUDA:

The Mercury Graphics Engine (MGE) represents features that use video card, or GPU, acceleration. In Photoshop CS6, this new engine delivers near-instant results when editing with key tools such as Liquify, Warp, Lighting Effects and the Oil Paint filter. The new MGE delivers unprecedented responsiveness for a fluid feel as you work. MGE is new to Photoshop CS6, and uses both the OpenGL and OpenCL frameworks. It does not use the proprietary CUDA framework from nVidia.

It seems to me that this amounts to a shift by Adobe from CUDA to OpenCL, which is a good thing for users of non-NVIDIA GPUs.

This also suggests to me that NVIDIA will need to ensure excellent OpenCL support in its GPU cards, as well as continuing to evolve CUDA, since Creative Suite is a key product for designers using the workstations which form a substantial part of the market for high-end GPUs.

NVIDIA releases CUDA Toolkit 4.1 with LLVM compiler

NVIDIA has released version 4.1 of its CUDA Toolkit for general purpose GPU computing.

image

There is a lot in this release, including a compiler based on LLVM, which will make it easier to support other programming languages; 1000 new imaging functions; and a re-designed visual profiler.

There is also an update to Parallel Nsight, for debugging and profiling CUDA applications in Visual Studio. This is free, though you have to register as an NVIDIA developer. You need this update to work with the 4.1 toolkit.

You do have to update your graphics card driver:

image

using a new build which NVIDIA has not gotten around to signing:

image

Still, lots of goodies here and a must-have for developers wishing to put their NVIDIA GPU to work for more than just games.

NVIDIA plans to merge CPU and GPU – eventually

I spoke to Dr Steve Scott, NVIDIA’s CTO for Tesla, at the end of the GPU Technology Conference which has just finished here in Beijing. In the closing session, Scott talked about the future of NVIDIA’s GPU computing chips. NVIDIA releases a new generation of graphics chips every two years:

  • 2008 Tesla
  • 2010 Fermi
  • 2012 Kepler
  • 2014 Maxwell

Yes, it is confusing that the Tesla brand, meaning cards for GPU computing, has persisted even though the Tesla family is now obsolete.

image
Dr Steve Scott showing off the power efficiency of GPU computing

Scott talked a little about a topic that interests me: the convergence or integration of the GPU and the CPU. The background here is that while the GPU is fast and efficient for parallel number-crunching, it is of course still necessary to have a CPU, and there is a price to pay for the communication between the two. The GPU and the CPU each have their own memory, so data must be copied back and forth, which is an expensive operation.

One solution is for GPU and CPU to share memory, so that a single pointer is valid on both. I asked CEO Jen-Hsun Huang about this and he did not give much hope for this:

We think that today it is far better to have a wonderful CPU with its own dedicated cache and dedicated memory, and a dedicated GPU with a very fast frame buffer, very fast local memory, that combination is a pretty good model, and then we’ll work towards making the programmer’s view and the programmer’s perspective easier and easier.

Scott on the other hand was more forthcoming about future plans. Kepler, which is expected in the first half of 2012, will bring some changes to the CUDA architecture which will “broaden the applicability of GPU programming, tighten the integration of the CPU and GPU, and enhance programmability,” to quote Scott’s slides. This integration will include some limited sharing of memory between GPU and CPU, he said.

What caught my interest though was when he remarked that at some future date NVIDIA will probably build CPU functionality into the GPU. The form that might take, he said, is that the GPU will have a couple of cores that do the CPU functions. This will likely be an implementation of the ARM CPU.

Note that this is not promised for Kepler nor even for Maxwell but was thrown out as a general statement of direction.

There are a couple of further implications. One is that NVIDIA plans to reduce its dependence on Intel. ARM is a better partner, Scott told me, because its designs can be licensed by anyone. It is not surprising then that Intel’s multi-core evangelist James Reinders was dismissive when I asked him about NVIDIA’s claim that the GPU is far more power-efficient than the CPU. Reinders says that the forthcoming MIC (Many Integrated Core) processors codenamed Knights Corner are a better solution, referring to the:

… substantial advantages that the Intel MIC architecture has over GPGPU solutions that will allow it to have the power efficiency we all want for highly parallel workloads, but able to run an enormous volume of code that will never run on GPGPUs (and every algorithm that can run on GPGPUs will certainly be able to run on a MIC co-processor).

In other words, Intel foresees a future without the need for NVIDIA, at least in terms of general-purpose GPU programming, just as NVIDIA foresees a future without the need for Intel.

Incidentally, Scott told me that he left Cray for NVIDIA because of his belief in the superior power efficiency of GPUs. He also described how the Titan supercomputer operated by the Oak Ridge National Laboratory in the USA will be upgraded from its current CPU-only design to incorporate thousands of NVIDIA GPUs, with the intention of achieving twice the speed of Japan’s K computer, currently the world’s fastest.

This whole debate also has implications for Microsoft and Windows. Huang says he is looking forward to Windows on ARM, which makes sense given NVIDIA’s future plans. That said, the I get impression from Microsoft is that Windows on ARM is not intended to be the same as Windows on x86 save for the change of processor. My impression is that Windows on ARM is Microsoft’s iOS, a locked-down operating system that will be safer for users and more profitable for Microsoft as app sales are channelled through its store. That is all very well, but suggests that we will still need x86 Windows if only to retain open access to the operating system.

Another interesting question is what will happen to Microsoft Office on ARM. It may be that x86 Windows will still be required for the full features of Office.

This means we cannot assume that Windows on ARM will be an instant hit; much is uncertain.

When will Intel’s Many Integrated Core processors be mainstream?

I’m at Intel’s software tools conference in Dubrovnik, which I have attended for the last three years, and as usual the big topic is concurrent programming and how to write code that takes advantage of the multiple cores in today’s computers.

Clearly this remains a critical subject, but in some ways the progress over these last three years has been disappointing when it comes to the PCs that most of us use. Many machines are only dual-core, which is sub-optimal for concurrent programming since there is an overhead to multi-threading programming that eats into the benefit of having two cores. Quad core is now common too, and more useful, but what about having 50 or 80 or more cores? This enables massively parallel processing of the kind that you can easily do today with general-purpose GPU programming using OpenCL or NVidia’s CUDA, but not yet on the CPU unless you have a super computer. I realise that GPU cores are not the same as CPU cores; but nevertheless they enable some spectacularly fast parallel processing.

I am interested therefore in Intel’s MIC or Many Integrated Core architecture, which combines 50 or more CPU cores on a single chip. MIC is already in preview, with hardware codenamed Knight’s Corner and a development kit called Knight’s Ferry. But when will MIC hit the mainstream for servers and workstations, and how long is it until we can have 50 cores on a commodity desktop PC? I spoke to Intel’s chief evangelist James Reinders.

Reinders first gave me some background on MIC:

“We’ve made those bold steps to dual core, quad core and we’ve got even ten core now, but if you look inside those microprocessors they have a very simple structure. All the cores are hooked together and share their connection to memory, through a shared cache usually that’s on the chip. It’s a simple computer structure, and we know from experience when you build computers with more and more processors, that eventually you go to more sophisticated connections between the cores. You don’t build a 1000-processor super computer and hook them all together with a bus to one memory.

“It’s inevitable that on a chip we need to design a more sophisticated connection. That’s what MIC’s about, that’s what the Larrabee project has always been about, a belief that we should take a bunch of x86 cores and hook them together with something more sophisticated. In this case it’s a ring, a bi-directional, 512-bit wide high performance ring, with multiple connections to memory off the chip, which gives us more bandwidth.

“That’s how I look at MIC, it’s putting a cluster-type of design on a chip.”

But what about timing?

“The first place you’ll see this is in servers and in workstations, where there’s a lot of demand for a lot of computation. In that case we’ll see that availability sometime by the end of 2012. The Intel product should be out late in that year.

“When will we see it in other devices? I think that’s a ways off. It’s a very high core count part, more than 50, it’s going to consume a fair amount of power. The same part 18 months later will probably consume half the power. So inside a decade we could see this being common on desktops, I don’t know about mobile devices, it might even make it to tablets. A decade’s a long time, it gives a lot of time for people to come up with innovative uses for it in software.

“We’ll see single core disappear everywhere.”

Incidentally, it is hard to judge how much computing power is “enough”. Although having many CPU cores may seem overkill for everyday computing, things like speech recognition or on-the-fly image processing make devices smarter at the expense of intense processing under the covers. From super computers to smartphones, if more computing capability is available history tells us that we will find ways to use it.

Is the triumph of the GPU the failure of the CPU?

I’m at NVIDIA’s GPU tech conference in San Jose. The central theme of the conference is that the capabilities of modern GPUs enable substantial performance gains for general computing, not just for graphics, though most of the examples we have seen involve some element of graphical processing. The reason you should care about this is that the gains are huge.

Take Matlab for example, a popular language and IDE for algorithm development, data analysis and mathematical computation. We were told in the keynote here yesterday that Matlab is offering a parallel computing toolkit based on NVIDIA’s CUDA, with speed-ups from 10 to 40 times. Dramatic performance improvements opens up new possibilities in computing.

Why has GPU performance advanced so rapidly, whereas CPU performance has levelled off? The reason is that they use different computing models. CPUs are general-purpose. The focus is on fast serial computation, executing a single thread as rapidly as possible. Since many applications are largely single-thread, this is what we need, but there are technical barriers to increasing clock speed. Of course multi-core and multi-processor systems are now standard, so we have dual-core or quad-core machines, with big performance gains for multi-threaded applications.

By contrast, GPUs are designed to be massively parallel. A Tesla C1060 has not 2 or 4 or 8 cores, but 240; the C2050 has 448. These are not the same as CPU cores, but nevertheless do execute in parallel. The clock speed is only 1.3Ghz, whereas an Intel Core i7 Extreme is 3.3Ghz, but the Intel CPU has a mere 6 cores.  An Intel Xeon 7560 runs at 2.266 Ghz and has 8 cores.The lower clock speed in the GPU is one reason it is more power-efficient.

NVIDIA’s CUDA initiative is about making this capability available to any application. NVIDIA made changes to its hardware to make it more amenable to standard C code, and delivered CUDA C with extensions to support it. In essence it is pretty simple. The extensions let you specify functions to execute on the GPU, allocate memory for pointers on the GPU, and copy memory between the GPU (called the device) and the main memory on the PC (called the host). You can also synchronize threads and use shared memory between threads.

The reward is great performance, but there are several disadvantages. One is the challenge of concurrent programming and the subtle bugs it can introduce.

Another is the hassle of copying memory between host and device. The device is in effect a computer within a computer. Shifting data between the two is relatively show.

A third is that CUDA is proprietary to NVIDIA. If you want your code to work with ATI’s equivalent, called Streams, then you should use the OpenCL library, though I’ve noticed that most people here seem to use CUDA; I presume they are able to specify the hardware and would rather avoid the compromises of a cross-GPU library. In the worst case, if you need to support both CUDA and non-CUDA systems, you might need to support different code paths depending on what is detected at runtime.

It is all a bit messy, though there are tools and libraries to simplify the task. For example, this morning we heard about GMAC, which makes host and device appear to use a single address space, though I imagine there are performance implications.

NVIDIA says it is democratizing supercomputing, bringing high performance computing within reach for almost anyone. There is something in that; but at the same time as a developer I would rather not think about whether my code will execute on the CPU or the GPU. Viewed at the highest level, I find it disappointing that to get great performance I need to bolster the capabilities of the CPU with a specialist add-on. The triumph of the GPU is in a sense the failure of the CPU. Convergence in some form or other strikes me as inevitable.

NVIDIA talks up GPU computing, presents roadmap

At the NVIDIA GPU Technology Conference in San Jose CEO Jen-Hsun Huang talked up the company’s progress in GPU computing, showed some example applications, and announced a high-level roadmap for future graphics chip architectures. NVIDIA has three areas of focus, he said: the Quadro line for visualisation, Tesla for parallel computing, and GeForce/Tegra for personal computing. Tegra is a system on a chip aimed at mobile devices. Mobile, says Huang, is “a completely disruptive force to all of computing.”

NVIDIA’s current chip architecture is called Fermi. The company is settling on a two-year product cycle and will deliver Kepler in 2011 with 3 to 4 times the performance (expressed as Gigaflops per watt) of Fermi. Maxwell in 2013 will have around 12 times the performance of Fermi. In between these architecture changes, NVIDIA will do “kicker” updates to refresh its products, with one for Fermi due soon.

The focus of the conference though is not on super-fast graphics cards in themselves, but rather on using the GPU for general purpose computing. GPUs are very, very good at doing mathematics fast and in parallel. If you have an application that does intensive calculations, then executing that part of the code on the GPU can offer impressive performance increases. NVIDIA’s CUDA library for C lets you do exactly that. Another option is OpenCL, a standard that works across GPUs from multiple vendors.

Adobe uses CUDA for the Mercury Playback engine in Creative Suite 5, greatly improving performance in After Effects, Premiere Pro and Photoshop, but with the annoyance that you have to use a compatible NVIDIA graphics card.

The performance gain from GPU programming is so great that it is unavoidable for applications in relevant areas, such as simulation or statistical analysis. Huang gave a compelling example during the keynote, bringing heart surgeon Dr Michael Black on stage to talk about his work. Operating on a beating heart is difficult because it presents a moving target. By combining robotic surgery with software that is able to predict the heart’s movement through simulation, he is researching how to operate on a heart almost as if it were stopped and with just a small incision.

Programming the GPU is compelling, but difficult. NVIDIA is keen to see it become part of mainstream programming, for obvious reasons, and there are new libraries and tools which help with this, like Parallel Nsight for Visual Studio 2010. Another interesting development, announced today, is CUDA for x86, being developed by PGI, which will let your CUDA code run even when an NVIDIA GPU is not present. Even if the performance gains are limited, it will mean developers who need to support diverse systems can run the same code, rather than having a different code path when no CUDA GPU is detected.

That said, GPU programming still has all the challenges of concurrent development, prone to race conditions and synchronization problems.

Stuffing a server full of GPUs is a cost-effective route to super-computing. I took a brief look at the exhibition, which includes this Colfax CXT8000 with 8 Tesla GPUs; it also has three 1200W power supplies. It may cost $25,000 but if you look at the performance you are getting for the price, machines like this are great value.

image