Tag Archives: gpu computing

Big GPU news at NVIDIA tech conference including first Tegra with CUDA

NVIDIA CEO Jen-Hsun Huang made a number of announcements at the GPU Technology Conference (GTC) keynote yesterday, including an updated roadmap for both desktop and mobile GPUs.


Although the focus of the GTC is on high-performance computing using Tesla GPU accelerator boards, Huang’s announcements were not limited to that area but also covered the company’s progress on mobile and on the desktop. Huang opened by mentioning the recently released GeForce Titan graphics processor which has 2,600 CUDA cores, and which starts from under £700 so is within reach of serious gamers as well as developers who can make use of it for general-purpose computing. CUDA enables use of the GPU for massively parallel general-purpose computing. NVIDIA is having problems keeping up with demand, said Huang.

There are now 430 million CUDA capable GPUs out there, said Huang, including 50 supercomputers, and coverage in 640 university courses.


He also mentioned last week’s announcement of the Swiss Piz Daint supercomputer which will include Tesla K20X GPU accelerators and will be operational in early 2014.

But what is coming next? Here is the latest GPU roadmap:


Kepler is the current GPU architecture, which introduced dynamic parallelism, the ability for the GPU to generate work without transitioning back to the CPU.

Coming next is Maxwell, which has unified virtual memory. The GPU can see the CPU memory, and the CPU can see the GPU memory, making programming easier. I am not sure how this impacts performance, but note that it is unified virtual memory, so the task of copying data between host and device still exists under the covers.

After Maxwell comes Volta, which focuses on increasing memory bandwidth and reducing latency. Volta includes a stack of DRAM on the same silicon substrate as the GPU, which Huang said enables 1TB per second of memory bandwidth.

What about mobile? NVIDIA is aware of the growth in devices of all kinds. 2.5bn high definition displays are sold each year, said Huang, and this will double again by 2015. These displays are mostly not for PCs, but on smartphones or embedded devices.

Here is the roadmap for Tegra, NVIDIA’s system-on-a-chip (SoC).


Tegra 4, which I saw in preview at last month’s mobile world congress in Barcelona, includes a software-defined modem and computational camera, able to tracks moving objects while keeping them in focus.

Next is Tegra Logan. This is the first Tegra to include CUDA cores so you can use it for general-purpose computing. It  is based on the Kepler GPU and supports full CUDA 5 computing as well as Open GL 4.3. Logan with be previewed this year and in production early 2014.

After Logan comes Parker. This will be based on the Maxwell GPU (see above) and NVIDIA’s own Denver (ARM-based) CPU. It will include FinFET multigate transistors.

According to Huang, Tegra performance will includes by 100 times over 5 years. Today’s Surface RT (which runs Tegra 3) may be sluggish, but Windows RT will run fine on these future SoCs. Of course Intel is not standing still either.

Finally, Huang announced the Grid Visual Computing Appliance, which I will be covering shortly in another post.

Exascale computing: you could do it today if you could supply the power says Nvidia

Nvidia’s Bill Dally has posted about the company’s progress towards exascale computing, boosted by a $12.4 million grant from the U.S. Department of Energy. He mentions that it would be possible to build an exascale supercomputer today, if you could supply enough power:

Exascale systems will perform a quintillion floating point calculations per second (that’s a billion billion), making them 1,000 times faster than a one petaflop supercomputer. The world’s fastest computer today is about 16 petaflops.

One of the great challenges in developing such systems is in making them energy efficient. Theoretically, an exascale system could be built with x86 processors today, but it would require as much as 2 gigawatts of power — the entire output of the Hoover Dam. The GPUs in an exascale system built with NVIDIA Kepler K20 processors would consume about 150 megawatts. The DOE’s goal is to facilitate the development of exascale systems that consume less than 20 megawatts by the end of the decade.

If the industry succeeds in driving down supercomputer power consumption to one fortieth of what it is today, I guess it also follows that tablets like the one on which I am typing now will benefit from much greater power efficiency. This stuff matters, and not just in the HPC (High Performance Computing) market.

NVIDIA Nsight comes to Eclipse for Mac, Linux GPU programming

NVIDIA has ported its Nsight development tools, previously a plug-in for Visual Studio, to run within the open source Eclipse IDE for use on Mac and Linux.


The Nsight tools include profiling, refactoring, syntax highlighting and auto-completion, as well as a bunch of code samples.

The Windows version for Visual Studio has also been updated, and now supports local GPU debugging as well as new support for DirectX frame debugging and analysis.

Although Eclipse of course runs on Windows, Nsight users should continue to use the Visual Studio version. NVIDIA is not supporting use of the Eclipse Nsight on Windows.

The tools are in preview and you can sign up to try them here.

Another significant development is the availability of the CUDA LLVM Compiler. NVIDIA has contributed CUDA compiler code to the open source LLVM project. This means that other languages which compile to LLVM intermediate assembly language can be adapted to support parallel processing on NVIDIA GPUs. The CUDA Compiler SDK will be made available this week at the NVIDIA GPU Technology Conference in San Jose.

Multicore processor wars: NVIDIA squares up to Intel

I first became aware of NVIDIA’s propaganda war against Intel at the 2012 GPU Technology conference in Beijing. CEO Jen-Hsun Huang stated that CPUs are remarkably inefficient for multicore processing:

The CPU is fast and is terrific at single-threaded performance, but because so much of the electronics inside the CPU is dedicated to out of order execution, branch prediction, speculative execution, all of the technology that has gone into sustaining instruction throughput and making the CPU faster at single-threaded applications, the electronics necessary to enable it to do that has grown tremendously. With four cores, in order to execute an operation, a floating point add or a floating point multiply, 50 times more energy is dedicated to the scheduling of that operation than the operation itself. If you look at the silicone of a CPU, the floating point unit is only a few percentage of the overall die, and it is consistent with the usage of the energy to sequence, to schedule the instructions running complicated programs.

That figure of 50 times surprised me, and I asked Intel’s James Reinders for a comment. He was quick to respond, noting that:

50X is ridiculous if it encourages you to believe that there is an alternative which is 50X better.  The argument he makes, for a power-efficient approach for parallel processing, is worth about 2X (give or take a little). The best example of this, it turns out, is the Intel MIC [Many Integrated Core] architecture.

Reinders went on to say:

Knights Corner is superior to any GPGPU type solution for two reasons: (1) we don’t have the extra power-sucking silicon wasted on graphics functionality when all we want to do is compute in a power efficient manner, and (2) we can dedicate our design to being highly programmable because we aren’t a GPU (we’re an x86 core – a Pentium-like core for “in order” power efficiency). These two turn out to be substantial advantages that the Intel MIC architecture has over GPGPU solutions that will allow it to have the power efficiency we all want for highly parallel workloads, but able to run an enormous volume of code that will never run on GPGPUs (and every algorithm that can run on GPGPUs will certainly be able to run on a MIC co-processor).

So Intel is evangelising its MIC vs GPCPU solutions such as NVIDIA’s Tesla line. Yesterday NVIDIA’s Steve Scott spoke up to put the other case. If Intel’s point is that a Tesla is really a GPU pressed into service for general computing, then Scott’s first point is that the cores in MIC are really CPUs, albeit of an older, simpler design:

They don’t really have the equivalent of a throughput-optimized GPU core, but were able to go back to a 15+ year-old Pentium design to get a simpler processor core, and then marry it with a wide vector unit to get higher flops per watt than can be achieved by Xeon processors.

Scott then takes on Intel’s most compelling claim, compatibility with existing x86 code. It does not matter much, says Scott, since you will have to change your code anyway:

The reality is that there is no such thing as a “magic” compiler that will automatically parallelize your code. No future processor or system (from Intel, NVIDIA, or anyone else) is going to relieve today’s programmers from the hard work of preparing their applications for the future.

What is the real story here? It would, of course, be most interesting to compare the performance of MIC vs Tesla, or against the next generation of NVIDIA GPGPUs based on Kepler; and may the fastest and most power-efficient win. That will have to wait though; in the meantime we can see that Intel is not enjoying seeing the world’s supercomputers install NVIDIA GPGPUs – the Oak Ridge National Laboratory Jaguar/Titan (the most powerful supercomputer in the USA) being a high profile example:

In addition, 960 of Jaguar’s 18,688 compute nodes now contain an NVIDIA graphical processing unit (GPU). The GPUs were added to the system in anticipation of a much larger GPU installation later in the year.

Equally, NVIDIA may be rattled by the prospect of Intel offering strong competition for Tesla. It has not had a lot of competition in this space.

There is an ARM factor here too. When I spoke to Scott in Beijing, he hinted that NVIDIA would one day produce GPGPUs with ARM chips embedded for CPU duties, perhaps sharing the same memory.

NVIDIA CEO Jen-Hsun Huang beats the drum for GPU computing

In his keynote at the GPU Technology Conference here in Beijing NVIIDA CEO Jens-Hsun Huang presented the simple logic of GPU computing. The main constraint on computing is power consumption, he said:

Power is now the limiter of every computing platform, from cellphones to PCs and even datacenters.

CPUs are optimized for single-threaded computing and are relatively inefficient. According to Huang a CPU spends 50 times as much power scheduling instructions as it does executing them. A GPU by contrast is formed of many simple processors and is optimized for parallel processing, making it more efficient when measured in FLOP/s (Floating Point Operations per Second), a way of benchmarking computer performance. Therefore it is inevitable that computers make use of GPU computing in order to achieve best performance. Note that this does not mean dispensing with the CPU, but rather handing off processing to the GPU when appropriate.

This point is now accepted in the world of supercomputers. The computer at Chinese National Supercomputing Center in Tianjin has 14,336 Intel CPUs, 7168 Nvidia Tesla GPUs, and 2048 custom-designed 8-core CPUs called Galaxy FT-1000, and can achieve 4.7 Petaflop/s for a power consumption of 4.04 MegaWatts (million watts), as presented this morning by the center’s Vice Director Xiaoquian Zhu. This is currently the 2nd fastest supercomputer in the world.

Huang says that without GPUs the world would wait until 2035 for the first Exascale (1 Exaflop/s) supercomputer, presuming a power constraint of 20MW and current levels of performance improvement year by year, whereas by combining CPUs with GPUs this can be achieved in 2019.

Supercomputing is only half of the GPU computing story. More interesting for most users is the way this technology trickles down to the kind of computers we actually use. For example, today Lenovo announced several workstations which use NVIDIA’s Maximus technology to combine a GPU designed primarily for driving a display (Quadro) with a GPU designed primarily for GPU computing (Tesla). These workstations are aimed at design professionals, for whom the ability to render detailed designs quickly is important. The image below shows a Lenovo S20 on display here. Maybe these are not quite everyday computers, but they are still PCs. Approximate price to follow soon when I have had a chance to ask Lenovo. Update: prices start at around $4500 for an S20, with most of the cost being for the Tesla board.


GPU computing with NVIDIA in Beijing

I’m in Beijing for NVIDIA’s GPU Technology Conference; I attended last year’s event in San Jose and found it fascinating, partly because it has an academic and research flavour with a huge variety of projects on display.

This year the event is in Beijing, reflecting the level of HPC (High Performance Computing) activity in this region.



NVIDIA’s business is graphics processors, though it has expanded into the SoC (System on a chip) business with its ARM-based Tegra chipset. This conference though is focused at the other end of the scale: Tesla GPUs that are primarily designed not for driving a display, but for rapid processing using massively parallel computing.

The Tesla business is relatively small for NVIDIA; less than 5% of its overall revenue, I was told; and I was told that the company treats it partly as research and development. That said, GPU computing is coming into the mainstream and the business is expected to grow. NVIDIA’s desktop GPU cards also support GPU computing.

I recently reviewed a video format converter from Cyberlink; the product was unexceptional except that it can take advantage of GPU computing when available to speed processing when converting from one video format to another. Since I do have a suitable graphics card (though sadly not a Tesla) this made a substantial difference, converting several times faster than another format converted I tried.

Of course NVIDIA is not the only player; there is an open standard (OpenCL) for GPU computing and other GPU vendors such as AMD implement OpenCL. NVIDIA implements OpenCL but also has its own CUDA architecture, which tends to be the focus of its conference as you would expect.

More reports soon.

GPU programming coming to low-power and mobile devices – from EU Mont Blanc supercomputer to smartphones

Supercomputing and low-power computing are not normally associated; but at the SC11 Supercomputing conference the Barcelona Supercomputing Center (BSC) has announced a new supercomputer, called the called the Mont-Blanc Project, which will combine the ARM-based NVIDIA Tegra SoC with separate CUDA GPUs. CUDA is NVIDIA’s parallel computing architecture, enabling general purpose computing on the GPU.

The project’s publicity says this enables power saving of 15 to 30 times, versus today’s supercomputers:

The analysis of the performance of HPC systems since 1993 shows exponential improvements at the rate of one order of magnitude every 3 years: One petaflops was achieved in 2008, one exaflops is expected in 2020. Based on a 20 MW power budget, this requires an efficiency of 50 GFLOPS/Watt. However, the current leader in energy efficiency achieves only 1.7n GFLOPS/Watt. Thus, a 30x improvement is required.

NVIDIA is also creating a new hardware and software development kit for Tegra + CUDA, to be made available in the first half of 2012.


The combination of fast concurrent processing, low power draw and mobile devices is enticing. Features like speech recognition and smart cameras depend on rapid processing, and the technology has the potential to make smart devices very much smarter.

NVIDIA has competition though. ARM, which designs most of the CPUs in use on smartphones and tablets today, has recently started designing mobile GPUs as well, and its Mali series supports OpenCL, an open alternative to CUDA for general-purpose computing on the GPU. The Mali-T604 has 1 to 4 cores while the recently announced Mali-T658 has 1 to 8 cores. ARM specifically optimises its GPUs to work alongside its CPUs, which must be a concern for GPU specialists such as NVIDIA. However, we have yet to see devices with either T604 or T658: the first T604 devices are likely to appear in 2012, and T658 in 2013.

GPU Programming for .NET: Tidepowerd’s GPU.NET gets some improvements, more needed

When I attended the 2010 GPU programming conference hosted by NVIDIA I encounted Tidepowerd, which has a .NET library called GPU.NET for GPU programming.

GPU programming enables amazing performance improvements for certain types of code. Most GPU programming is done in C/C++, but Typepowerd lets you run code in .NET, simply marking any methods you want to run on the GPU with a [kernel] attribute:


private static void AddGpu(float[] a, float[] b, float[] c)


// Get the thread id and total number of threads

int ThreadId = BlockDimension.X * BlockIndex.X + ThreadIndex.X;

int TotalThreads = BlockDimension.X * GridDimension.X;

for (int ElementIndex = ThreadId; ElementIndex < a.Length; ElementIndex += TotalThreads)


c[ElementIndex] = a[ElementIndex] + b[ElementIndex];



GPU.NET is now at version 2.0 and includes Visual Studio Error List and IntelliSense support. This is useful, since some C# code will not run on the GPU. Strings, for example, are not supported. Take a look at this article which lists .NET OpCodes that do not work in GPU.NET.

GPU.NET requires an NVIDIA GPU with CUDA support and a CUDA 3.0 driver. It can run on Mac and Linux using Mono, the open source implementation of .NET. In principle, GPU.NET could also work with AMD GPUs or others via a vendor-specific runtime:


but the latest FAQ says:

Support for AMD devices is currently under development, and support for other hardware architectures will follow shortly.

Another limitation is support for multiple GPUs. If you want to do serious supercomputing relatively cheaply, stuffing a PC with a bunch of Tesla GPUs is a great way to do it, but currently GPU.NET only used one GPU per active thread as far as I can tell from this note:

The GPU.NET runtime includes a work-scheduling system which can distribute device method (“kernel”) calls to multiple GPUs in the system; at this time, this only works for applications which call device-based methods from multiple host threads using multiple CPU cores. In a future release, GPU.NET will be able to use multiple GPUs to execute a single method call.

I doubt that GPU.NET or other .NET libraries will ever compete with C/C++ for performance, but ease of use and productivity count for a lot too. Potentially GPU.NET could bring GPU programming to the broad range of .NET developers.

It is also worth checking out hoopoe’s CUDA.NET and OpenCL.NET which are free libraries. I have not done a detailed comparison but would be interested to hear from others who have.

NVIDIA postpones GPU Technology Conference to Spring 2012

NVIDIA is postponing its GPU Technology Conference, which was set for October 2011 to a date yet to be announced in April or May 2012, in the San Francisco Bay Area.

What’s the reason? This is what its email newsletter says:

To better align our flagship North American GTC with our growing number of GTC regional events, as well as other events in the HPC calendar, we will establish GTC as an annual springtime event. We will use the Supercomputing Conference (SC) in the fall as a leading venue for advancing GPU computing, and firmly establish GTC as an annual fixture in the spring.

It seems that the October date was too close to that of the Supercomputing Conference 11, which is set for November 12-18 in Seattle.

The company is promising an expanded series of regional events, to support interest in its CUDA language for general-purpose programming on the GPU.

NVIDIA CUDA 4.0 simplifies GPU programming, aims for mainstream

NVIDIA has announced CUDA 4.0, a major update to its C++ toolkit for general programming on the GPU. The idea is to take advantage of the many cores of NVIDIA’s GPUs for speeding up tasks that may not be graphic-related.

There are three key features:

Unified Virtual Addressing provides a single address space for the main system RAM and the GPU RAM, or even RAM across multiple GPUs if available. This significantly simplifies programming.


GPUDIRECT 2.0 is NVIDIA’s name for peer-to-peer communication between multiple GPUs on the same computer. Instead of copying objects from one GPU, to main memory, and to a second GPU, the data can go directly.

Thrust C++ template libraries Thrust is a CUDA library which is similar to the parallel algorithms in the C++ Standard Template Library (STL). NVIDIA claims that typical Thrust routines are 5 to 100 times faster than with STL or Intel’s Threading Building Blocks. Thrust is not really new but is getting pushed to the mainstream of CUDA programming.

Other new features include debugging (cuda-gdb) support on Mac OS X, support for new/delete and virtual functions in C++, and improvement to multi-threading.

The common theme of these features is to make it easier for mortals to move from general C/C++  programming to CUDA programming, and to port existing code. This is how NVIDIA sees CUDA progress:


Certainly I see increasing interest in GPU programming, and not just among super-computer researchers.

A weakness is that CUDA only works on NVIDIA GPUs. You can use OpenCL for generic GPU programming but it is less advanced.

CUDA 4.0 release candidate will be available from March 4 if you sign up for the CUDA Registered Developer Program.