Tag Archives: gpu

NVIDIA releases CUDA Toolkit 4.1 with LLVM compiler

NVIDIA has released version 4.1 of its CUDA Toolkit for general purpose GPU computing.


There is a lot in this release, including a compiler based on LLVM, which will make it easier to support other programming languages; 1000 new imaging functions; and a re-designed visual profiler.

There is also an update to Parallel Nsight, for debugging and profiling CUDA applications in Visual Studio. This is free, though you have to register as an NVIDIA developer. You need this update to work with the 4.1 toolkit.

You do have to update your graphics card driver:


using a new build which NVIDIA has not gotten around to signing:


Still, lots of goodies here and a must-have for developers wishing to put their NVIDIA GPU to work for more than just games.

On Supercomputers, China’s Tianhe-1A in particular, and why you should think twice before going to see one

I am just back from Beijing courtesy of Nvidia; I attended the GPU Technology conference and also got to see not one but two supercomputers:  Mole-8.5 in Beijing and Tianhe-1A in Tianjin, a coach ride away.

Mole-8.5 is currently at no. 21 and Tianhe-1A at no. 2 on the top 500 list of the world’s fastest supercomputers.

There was a reason Nvidia took journalists along, of course. Both are powered partly by Nvidia Tesla GPUs, and it is part of the company’s campaign to convince the world that GPUs are essential for supercomputing, because of their greater efficiency than CPUs. Intel says we should wait for its MIC (Many Integrated Core) CPU instead; but  Nvidia has a point, and increasing numbers of supercomputers are plugging in thousands of Nvidia GPUs. That does not include the world’s current no. 1, Japan’s K Computer, but it will include the USA’s Titan, currently no. 3, which will add up to 18.000 GPUs in 2012 with plans that may take it to the top spot; we were told that that it aims to be twice as fast as the K Computer.

Supercomputers are important. They excel at processing large amounts of data, so typical applications are climate research, biomedical research, simulations of all kinds used for design and engineering, energy modelling, and so on. These efforts are important to the human race, so you will never catch me saying that supercomputers are esoteric and of no interest to most of us.

That said, supercomputers are physically little different from any other datacenter: rows of racks. Here is a bit of Mole-8.5:


and here is a bit of Tianhe-1A:


In some ways Tianhe-1A is more striking from outside.


If you are interested in datacenters, how they are cooled, how they are powered, how they are constructed, then you will enjoy a visit to a supercomputer. Otherwise you may find it disappointing, especially given that you can run an application on a supercomputer without any need to be there physically.

Of course there is still value in going to a supercomputing centre to talk to the people who run it and find out more about how the system is put together. Again though I should warn you that physically a supercomputer is repetitive. They achieve their mighty flop/s (floating point per second) counts by having lots and lots of processors (whether CPU or GPU) running in parallel. You can make a supercomputer faster by adding another cupboard with another set of racks with more boards with CPUs


or GPUs


and provided your design is right you will get more flop/s.

Yes there is more to it than that, and points of interest include the speed of the network, which is critical in order to support high performance, as well as the software that manages it. Take a look at the K Computer’s Tofu Interconnect. But the term “supercomputer” is a little misleading: we are talking about a network of nodes rather than a single amazing monolithic machine.

Personally I enjoyed the tours, though the visit to Tianhe-1A was among the more curious visits I have experienced. We visited along with a bunch of Nvidia executives. The execs sat along one side of a conference table, the Chinese hosts along the other side, and they engaged in a diplomatic exercise of being very polite to each other while the journalists milled around the room.


We did get a tour of Tianhe-1A but unfortunately little chance to talk to the people involved, though we did have a short group interview with the project director, Liu Guangming.


He gave us short, guarded but precise answers, speaking through an interpreter. We asked about funding. “The way things work here is different from how it works in the USA,” he said, “The government supports us a lot, the building and infrastructure, all the machines, are all paid for by the government. The government also pays for the operational cost.” Nevertheless, users are charged for their time on Tianhe-1A, but this is to promote efficiency. “If users pay they use the system more efficiently, that is the reason for the charge,” he said. However, the users also get their funding from the government’s research budget.

Downplayed on the slides, but mentioned here, is the fact that the supercomputer was developed by the “National team of defence technology.” Food for thought.

We also asked about the usage of the GPU nodes as opposed to the CPU nodes, having noticed that many of the applications presented in the briefing were CPU-only. “The GPU stage is somewhat experimental,” he said, though he is “seeing increasing use of the GPU, and such a heterogeneous system should be the future of HPC [High Performance Computing].” Some applications do use the GPU and the results have been good. Overall the system has 60-70% sustained utilisation.

Another key topic: might China develop its own GPU? Tianhe-1A already includes 2048 China-designed “Galaxy FT” CPUs, alongside 14336 Intel CPUs and 7168 NVIDIA GPUS.

We already have the technology, said Guangming.

From 2005 -7 we designed a chip, a stream processor similar to a GPU. But the peak performance was not that good. We tried AMD GPUs, but they do not have EEC [Extended Error Correction], so that is why we went to NVIDIA. China does have the technology to make GPUs. Also the technology is growing, but what we implement is a commercial decision.

Liu Guangming closed with a short speech.

Many of the people from outside China might think that China’s HPC experienced explosive development last year. But China has been involved in HPC for 20 years. Next, the Chinese government is highly committed to HPC. Third, the economy is growing fast and we see the demand for HPC. These factors have produced the explosive growth you witnessed.

The Tianjin Supercomputer is open and you are welcome to visit.

NVIDIA plans to merge CPU and GPU – eventually

I spoke to Dr Steve Scott, NVIDIA’s CTO for Tesla, at the end of the GPU Technology Conference which has just finished here in Beijing. In the closing session, Scott talked about the future of NVIDIA’s GPU computing chips. NVIDIA releases a new generation of graphics chips every two years:

  • 2008 Tesla
  • 2010 Fermi
  • 2012 Kepler
  • 2014 Maxwell

Yes, it is confusing that the Tesla brand, meaning cards for GPU computing, has persisted even though the Tesla family is now obsolete.

Dr Steve Scott showing off the power efficiency of GPU computing

Scott talked a little about a topic that interests me: the convergence or integration of the GPU and the CPU. The background here is that while the GPU is fast and efficient for parallel number-crunching, it is of course still necessary to have a CPU, and there is a price to pay for the communication between the two. The GPU and the CPU each have their own memory, so data must be copied back and forth, which is an expensive operation.

One solution is for GPU and CPU to share memory, so that a single pointer is valid on both. I asked CEO Jen-Hsun Huang about this and he did not give much hope for this:

We think that today it is far better to have a wonderful CPU with its own dedicated cache and dedicated memory, and a dedicated GPU with a very fast frame buffer, very fast local memory, that combination is a pretty good model, and then we’ll work towards making the programmer’s view and the programmer’s perspective easier and easier.

Scott on the other hand was more forthcoming about future plans. Kepler, which is expected in the first half of 2012, will bring some changes to the CUDA architecture which will “broaden the applicability of GPU programming, tighten the integration of the CPU and GPU, and enhance programmability,” to quote Scott’s slides. This integration will include some limited sharing of memory between GPU and CPU, he said.

What caught my interest though was when he remarked that at some future date NVIDIA will probably build CPU functionality into the GPU. The form that might take, he said, is that the GPU will have a couple of cores that do the CPU functions. This will likely be an implementation of the ARM CPU.

Note that this is not promised for Kepler nor even for Maxwell but was thrown out as a general statement of direction.

There are a couple of further implications. One is that NVIDIA plans to reduce its dependence on Intel. ARM is a better partner, Scott told me, because its designs can be licensed by anyone. It is not surprising then that Intel’s multi-core evangelist James Reinders was dismissive when I asked him about NVIDIA’s claim that the GPU is far more power-efficient than the CPU. Reinders says that the forthcoming MIC (Many Integrated Core) processors codenamed Knights Corner are a better solution, referring to the:

… substantial advantages that the Intel MIC architecture has over GPGPU solutions that will allow it to have the power efficiency we all want for highly parallel workloads, but able to run an enormous volume of code that will never run on GPGPUs (and every algorithm that can run on GPGPUs will certainly be able to run on a MIC co-processor).

In other words, Intel foresees a future without the need for NVIDIA, at least in terms of general-purpose GPU programming, just as NVIDIA foresees a future without the need for Intel.

Incidentally, Scott told me that he left Cray for NVIDIA because of his belief in the superior power efficiency of GPUs. He also described how the Titan supercomputer operated by the Oak Ridge National Laboratory in the USA will be upgraded from its current CPU-only design to incorporate thousands of NVIDIA GPUs, with the intention of achieving twice the speed of Japan’s K computer, currently the world’s fastest.

This whole debate also has implications for Microsoft and Windows. Huang says he is looking forward to Windows on ARM, which makes sense given NVIDIA’s future plans. That said, the I get impression from Microsoft is that Windows on ARM is not intended to be the same as Windows on x86 save for the change of processor. My impression is that Windows on ARM is Microsoft’s iOS, a locked-down operating system that will be safer for users and more profitable for Microsoft as app sales are channelled through its store. That is all very well, but suggests that we will still need x86 Windows if only to retain open access to the operating system.

Another interesting question is what will happen to Microsoft Office on ARM. It may be that x86 Windows will still be required for the full features of Office.

This means we cannot assume that Windows on ARM will be an instant hit; much is uncertain.

Flash to get 3D acceleration with “Molehill”

One of the demos here at Adobe Max was a 3D racing game, running in Flash with 3D acceleration. It was enabled by a new set of GPU-accelerated APIs codenamed Molehill. Adobe CTO Kevin Lynch remarked that with GPU-accelerated 3D, Flash games could come closer to console games in the experience they offer. Lynch also demonstrated using a game controller with a Flash game.

There are no precise dates for availability, but Adobe expects to offer a public beta in the first half of 2011. The APIs will be available in a future version of the Flash Player. Under the covers, the 3D APIs will user DirectX 9 on Windows and OpenGL 1.3 on MacOS and Linux. If no supported 3D API is found on a particular platform, Flash will fall back to software rendering.

One interesting aspect is that Molehill will also work on mobile devices, where it will use OpenGL ES 2.0. Apparently GPUs will be common on mobile devices because they enable longer battery life than relying on the CPU for all processing. I heard similar remarks at the NVIDIA GPU conference last month.

This will be a significant development, especially when put in the context of Flash appearing in the living room, built into a TV or on Google TV.

Adobe’s plenoptic lens enables refocus magic

The most eye-opening demonstration at the NVIDIA GPU Technology Conference last week was from Adobe’s David Salesin (Sr. Principal Scientist) and Todor Georgiev (Sr Research Scientist), who showed their Plenoptic Lens along with software for processing the resulting images.


There was a gasp of amazement from the audience when we saw what the process is capable of. We saw an image refocused after the event.

image image

For anyone who has ever taken an out of focus picture – which I guess is everyone – the immediate reaction is to want one NOW. Another appealing idea is to take an image that has several items of interest, but at different depths, and shift the focus from one to another.

So how does it work? It starts with the plenoptic lens, which lets you “capture multiple views of the scene from slightly different viewpoints,” said Salesin:

If you have a high resolution sensor then each one of those images can be fairly high resolution. The neat thing is that with software, with computation, you can put this together into one large high-resolution image.

In a sense you are capturing a whole 4D lightfield. You’ve got two dimensions of the spatial position of the light ray, and also two dimensions of the orientation of the light ray.

With that 4D image, you can then after the fact use computation to take the place of optics. With computation you have a lot more flexibility. You can change the vantage point, the viewpoint a little bit, and you can also change the focus.

To resolve that, to take these individual little pieces of an image and put them together into one large image from any arbitrary view with any arbitrary focus, it turns out that texture mapping hardware is exactly what you need to do that. Using GPU chips we’ve been able to get speedups over the CPU of about 500 times.

Note that the image ends up being constructed in software. It is not just a matter of overlaying the small images in a certain way.

There is a good reason NVIDIA showed this at its conference. Suddenly we all want little cameras with GPUs powerful enough to do this on the fly.

I guess this demo is likely to show up again at the Adobe MAX conference next month.

There’s another report on this with diagrams here.