Tag Archives: nvidia

GPU computing with NVIDIA in Beijing

I’m in Beijing for NVIDIA’s GPU Technology Conference; I attended last year’s event in San Jose and found it fascinating, partly because it has an academic and research flavour with a huge variety of projects on display.

This year the event is in Beijing, reflecting the level of HPC (High Performance Computing) activity in this region.



NVIDIA’s business is graphics processors, though it has expanded into the SoC (System on a chip) business with its ARM-based Tegra chipset. This conference though is focused at the other end of the scale: Tesla GPUs that are primarily designed not for driving a display, but for rapid processing using massively parallel computing.

The Tesla business is relatively small for NVIDIA; less than 5% of its overall revenue, I was told; and I was told that the company treats it partly as research and development. That said, GPU computing is coming into the mainstream and the business is expected to grow. NVIDIA’s desktop GPU cards also support GPU computing.

I recently reviewed a video format converter from Cyberlink; the product was unexceptional except that it can take advantage of GPU computing when available to speed processing when converting from one video format to another. Since I do have a suitable graphics card (though sadly not a Tesla) this made a substantial difference, converting several times faster than another format converted I tried.

Of course NVIDIA is not the only player; there is an open standard (OpenCL) for GPU computing and other GPU vendors such as AMD implement OpenCL. NVIDIA implements OpenCL but also has its own CUDA architecture, which tends to be the focus of its conference as you would expect.

More reports soon.

New OpenACC compiler directives announced for GPU accelerated programming

A new standard for accelerating C/C++ programming with compiler directives has been announced at the SC11 Supercomputing conference in Seattle. The new standard is called OpenACC  and has been created by NVIDIA, Cray, PGI (Portland Group) and CAPS enterprise.

OpenACC compiler directives are code annotations that enable the compiler to parallelise code while ensuring thread-safety. The big difference between OpenACC and the existing OpenMP standard is that OpenACC primarily targets the GPU rather than CPU, whereas OpenMP is generally CPU only. That said, OpenACC can also target the CPU so it is flexible; the idea is that it will adapt to the target system.


OpenACC is “defined to be interoperable with OpenMP” according to the FAQ and the OpenACC CEO hopes for some future integration, though it seems to have been developed independently which may cause some tension.

OpenACC is expected to ship during the first half of 2012 on compilers from PGI, Cray and CAPS Enterprise. The NVIDIA involvement may make you wonder whether it is GPU-specific; the answer is “maybe”. The FAQ says:

Will OpenACC run on AMD GPUs?

– It could, it requires implementation, there is no reason why it couldn’t

Will OpenACC run on top of OpenCL?

– It could, it requires implementation, there is no reason why it couldn’t

Will AMD/Intel/MS/XX support this?

– As this is just announced we can’t speak to the rate of external adoption or participation.

Will OpenACC run on NVIDIA GPUs with CUDA?

– Yes. Programmers may wish to develop some code using directives, and more sophisticated code using CUDA C, CUDA C++ or CUDA Fortra

Spot the Yes in the above! Still, you can scarcely blame NVIDIA for supporting its own GPU family; and I have been impressed with how the company works with the scientific and academic community to realise the potential of massively parallel computing.

OpenACC is about democratising parallelism, rather than advancing the state of the art. Best optimisation is obtained by more complex programming, but directives make some remarkable performance improvements easy to achieve.

GPU Programming for .NET: Tidepowerd’s GPU.NET gets some improvements, more needed

When I attended the 2010 GPU programming conference hosted by NVIDIA I encounted Tidepowerd, which has a .NET library called GPU.NET for GPU programming.

GPU programming enables amazing performance improvements for certain types of code. Most GPU programming is done in C/C++, but Typepowerd lets you run code in .NET, simply marking any methods you want to run on the GPU with a [kernel] attribute:


private static void AddGpu(float[] a, float[] b, float[] c)


// Get the thread id and total number of threads

int ThreadId = BlockDimension.X * BlockIndex.X + ThreadIndex.X;

int TotalThreads = BlockDimension.X * GridDimension.X;

for (int ElementIndex = ThreadId; ElementIndex < a.Length; ElementIndex += TotalThreads)


c[ElementIndex] = a[ElementIndex] + b[ElementIndex];



GPU.NET is now at version 2.0 and includes Visual Studio Error List and IntelliSense support. This is useful, since some C# code will not run on the GPU. Strings, for example, are not supported. Take a look at this article which lists .NET OpCodes that do not work in GPU.NET.

GPU.NET requires an NVIDIA GPU with CUDA support and a CUDA 3.0 driver. It can run on Mac and Linux using Mono, the open source implementation of .NET. In principle, GPU.NET could also work with AMD GPUs or others via a vendor-specific runtime:


but the latest FAQ says:

Support for AMD devices is currently under development, and support for other hardware architectures will follow shortly.

Another limitation is support for multiple GPUs. If you want to do serious supercomputing relatively cheaply, stuffing a PC with a bunch of Tesla GPUs is a great way to do it, but currently GPU.NET only used one GPU per active thread as far as I can tell from this note:

The GPU.NET runtime includes a work-scheduling system which can distribute device method (“kernel”) calls to multiple GPUs in the system; at this time, this only works for applications which call device-based methods from multiple host threads using multiple CPU cores. In a future release, GPU.NET will be able to use multiple GPUs to execute a single method call.

I doubt that GPU.NET or other .NET libraries will ever compete with C/C++ for performance, but ease of use and productivity count for a lot too. Potentially GPU.NET could bring GPU programming to the broad range of .NET developers.

It is also worth checking out hoopoe’s CUDA.NET and OpenCL.NET which are free libraries. I have not done a detailed comparison but would be interested to hear from others who have.

NVIDIA postpones GPU Technology Conference to Spring 2012

NVIDIA is postponing its GPU Technology Conference, which was set for October 2011 to a date yet to be announced in April or May 2012, in the San Francisco Bay Area.

What’s the reason? This is what its email newsletter says:

To better align our flagship North American GTC with our growing number of GTC regional events, as well as other events in the HPC calendar, we will establish GTC as an annual springtime event. We will use the Supercomputing Conference (SC) in the fall as a leading venue for advancing GPU computing, and firmly establish GTC as an annual fixture in the spring.

It seems that the October date was too close to that of the Supercomputing Conference 11, which is set for November 12-18 in Seattle.

The company is promising an expanded series of regional events, to support interest in its CUDA language for general-purpose programming on the GPU.

NVIDIA CUDA 4.0 simplifies GPU programming, aims for mainstream

NVIDIA has announced CUDA 4.0, a major update to its C++ toolkit for general programming on the GPU. The idea is to take advantage of the many cores of NVIDIA’s GPUs for speeding up tasks that may not be graphic-related.

There are three key features:

Unified Virtual Addressing provides a single address space for the main system RAM and the GPU RAM, or even RAM across multiple GPUs if available. This significantly simplifies programming.


GPUDIRECT 2.0 is NVIDIA’s name for peer-to-peer communication between multiple GPUs on the same computer. Instead of copying objects from one GPU, to main memory, and to a second GPU, the data can go directly.

Thrust C++ template libraries Thrust is a CUDA library which is similar to the parallel algorithms in the C++ Standard Template Library (STL). NVIDIA claims that typical Thrust routines are 5 to 100 times faster than with STL or Intel’s Threading Building Blocks. Thrust is not really new but is getting pushed to the mainstream of CUDA programming.

Other new features include debugging (cuda-gdb) support on Mac OS X, support for new/delete and virtual functions in C++, and improvement to multi-threading.

The common theme of these features is to make it easier for mortals to move from general C/C++  programming to CUDA programming, and to port existing code. This is how NVIDIA sees CUDA progress:


Certainly I see increasing interest in GPU programming, and not just among super-computer researchers.

A weakness is that CUDA only works on NVIDIA GPUs. You can use OpenCL for generic GPU programming but it is less advanced.

CUDA 4.0 release candidate will be available from March 4 if you sign up for the CUDA Registered Developer Program.

Qualcomm: optimising for Windows Phone took years not months

I had a chat with Qualcomm’s Raj Talluri here at Mobile World Congress in Barcelona. Of course I asked about the Nokia-Microsoft deal and the implications for Qualcomm. Currently Microsoft specify Qualcomm’s Snapdragon as the required chipset for Windows Phone 7 devices: good for Qualcomm, not so good for Microsoft since it means competing system-on-a-chip vendors like TI and NVidia are putting all their efforts into Android or other mobile operating systems.

“We are extremely pleased and we are very optimistic that it will bring us additional business.” said Talluri about the Nokia-Microsoft alliance. That said, might Nokia in fact choose a competing chipset for its Windows Phone devices?

It might; but the issue here is the work involved in optimising the hardware and drivers for the OS:

If you look at Windows Phone, there’s a lot of custom work we did with Microsoft that makes Windows Phone 7 really shine on Snapdragon … the amount of time we spent in getting those things optimized, it’s been a multi-year effort for us.

If you put this together with Nokia’s announced intention to ship Windows Phone devices this year, it is hard to see how it could use a chipset other than Snapdragon.

That said, those other vendors might not agree that it would take years. When I asked about this, NVidia gave me the impression that it could do the work in a few months, if there was a business case for it.

Still, it is not a trivial matter, and adds potential for delay. I think we should expect Nokia’s first Windows phones to run Qualcomm chipsets.

If the Windows Phone ecosystem builds as Nokia hopes, other chipset vendors may get involved. Then again, what are Microsoft’s plans for the Windows Phone OS long-term? Might the underlying Windows CE OS get scrapped in favour of something coming out of the Windows on Arm project? Silverlight and XNA apps should port across easily.

That is a matter for speculation, but the possibility may deter other mobile chipset manufacturers from heavy investment in Windows Phone support.

NVidia: first mobile quad-core devices will be this year

Qualcomm was first to announce a quad-core mobile chipset here at Mobile World Congress in Barcelona – the Snapdragon APQ8064 – but NVidia says it will be first to market, with its quad-core successor to Tegra 2, code-named Kal-El. NVidia expects a Kal-El Android tablet to ship in August 2011, with smartphones to follow in the autumn. Qualcomm on the other hand says that samples of the APQ8064 are anticipated to be available in early 2012, implying that products will come later next year.

Kal-El is the successor to Tegra 2, and said to be 5 times faster. It also includes a 12-core GPU and supports HD video up to 12560×1600 – amazing for a low-power mobile chipset.


A prototype is running on NVidia’s stand here and while my snap does not show the quality, you will have to take my word that the graphics looked excellent.


NVIDIA Tegra 2: amazing mobile power that hints at the future of client computing

Smartphone power has made another jump forward with the announcement at CES in Las Vegas of new devices built on NVIDIA’s new Tegra 2 package – a System on a Chip (SoC) that includes dual-core CPU, GPU, and additional support for HD video encoding and decoding, audio, imaging, USB, PCIe and more:


The CPU is the ARM Cortex-A9 which has a RISC (Reduced Instruction Set Computer) architecture and a 32-bit instruction set. It also supports the Thumb-2 instruction set which is actually 16-bit. How is 16-bit an upgrade over 32-bit? Well, 16-bit instructions means smaller code, even though it gets translated to 32-bit instructions at runtime:

For performance optimised code Thumb-2 technology uses 31 percent less memory to reduce system cost, while providing up to 38 percent higher performance than existing high density code, which can be used to prolong battery-life or to enrich the product feature set.

The GPU is an “ultra low power” (ULP) 8-core GeForce. In essence, the package aims for high performance with low power consumption, exactly what is wanted for mobile computing.

Power is also saved by sophisticated power management features. The package uses a combination of suspending parts of the system, gating the clock speed, screen management, and dynamically adjusting voltage and frequency, in order to save power. The result is a system which NVIDIA claims is 25-50 times more efficient than a typical PC.

According to NVIDIA, Tegra 2 enables web browsing up to two times faster than competitors such as the Qualcomm Snapdragon 8250 or Texas Instruments OMAP 3630 – though of course these companies also have new SoCs in preparation.

Tegra 2 is optimised for some specific software. One is the OpenGL graphics API. “The job of the GPU is to implement the logical pipeline defined by OpenGL”, I was told at an NVIDIA briefing.


I asked whether this meant that Tegra 2 is sub-optimal for Microsoft’s Direct X API; but NVIDIA says it is sufficiently similar that it makes no difference.

Nevertheless, Tegra 2 has been designed with Android in mind, not Windows. There are a couple of reasons for this. The main one is that Android has all the momentum in the market; but apart from that, Microsoft partnered with Qualcomm for Windows Phone 7, which runs on Snapdragon, shutting out NVIDIA at the initial launch. NVIDIA is a long-term Microsoft partner and the shift from Windows Mobile to Android has apparently cost NVIDIA a lot of time. The shift took place around 18 months ago, when NVIDIA saw how the market was moving. That shift “cost us a year to a year and a half of products to market”, I was told – a delay which must include changes at every level from hardware optimisation, to designing the kind of package that suits the devices Android vendors want to build, to building up knowledge of Android in order to market effectively to hardware vendors.

Despite this focus, Microsoft demonstrated Windows 8 running on Tegra during Steve Ballmer’s keynote, so this should not be taken to mean that Windows or Windows CE will not run. I still found it interesting to hear this example of how deeply the industry has moved away from Microsoft’s mobile platform.

Microsoft should worry. NVIDIA foresees that “all of your computing needs are ultimately going to be surfaced through your mobile device”. Tegra 2 is a step along the way, since HDMI support is built-in, enabling high resolution displays. If you want to do desktop computing, you sit down at your desk, pop your mobile into a dock, and get on with your work or play using a large screen and a keyboard. It seems plausible to me.

During the press conference at CES we were shown an example of simultaneous rich graphic gaming on PC, PlayStation 3, and Tegra 2 Smartphone.


Alongside Android, Tegra 2 is optimised for Adobe Flash. NVIDIA has been given full access to the source of the Flash player in order to deliver hardware acceleration.



Actual devices

What about actual devices? Two that were shown at CES are the LG Optimus 2X:


and the Motorola Atrix 4G:


Both sport impressive specifications; though the Guardian’s Charles Arthur, who attended a briefing on the Atrix 4G, expresses some scepticism about whether HD video (which needs a large display) and the full desktop version of FireFox are really necessary on a phone. Apparently the claimed battery life is only 8 hours; some of us might be willing to sacrifice a degree of that capability for a longer battery life.

Still, while some manufacturers will get the balance between cost, features, size and battery life wrong, history tells that we will find good ways to use these all this new processing and graphics power, especially if we can get to the point where such a device, combined with cloud computing and a desktop dock, becomes the only client most of us need.

NVIDIA says that over 50 Android/Tegra 2 products are set to be released by mid-2011, in tablet as well as Smartphone form factors. I’m guessing that at least some of these will be winners.

Steve Ballmer at CES: Microsoft pins mobile hopes on Windows 8

Microsoft CEO Steve Ballmer gave the keynote at CES in Las Vegas last night. It was a polished performance and everything worked, but was short on vision or any immediate answer to the twin forces of Apple iPad and Google Android which are squeezing out Microsoft in the mobile world – smartphones and tablets – which currently forms the centre of attention in personal computing.

That said, CES stands for Consumer Electronics Show; and Ballmer did a good job showing off how well Kinect is performing, claiming sales of 8 million already. He showed more examples of controlling Xbox through speech and gesture, and said that Kinect is also boosting sales of the console; clearly it is now taking it beyond the hardcore market of first-person shooters.

We saw some fun new Windows devices, such as Acer’s dual-screen Iconia laptop.


There was also a demonstration of the updated Microsoft Surface which now runs full Windows 7 and does not require hidden cameras, so that it can now be used in more scenarios, such as for interactive digital signage.

All well and good; but what about mobile? We got a Windows Phone 7 demo, but no sales figures, nor any mobile partners on stage; I’m guessing they are too busy promoting their new Android devices. Ballmer did say that the phone is coming on Verizon and Sprint in the first half of this year. Application availability is improving, but how will Microsoft win attention for its smartphone? My local high street is full of mobile phone shops, none of which even stock it as far as I can tell. There is a tie-in with Xbox Live which may help a little.

The problem though is that Microsoft does not seem to be wholeheartedly behind the Windows Phone 7 OS, which is based on Windows CE with a new GUI and Silverlight/XNA runtime for applications. Rather, Microsoft is signalling that full Windows is its future mobile operating system. At CES it announced Windows on ARM, the processor of choice in mobile, and during the keynote we saw the next version of Windows (though with the Windows 7 GUI) running on various ARM devices.

The power available in new System on a Chip packages like NVIDIA’s Tegra 2 leaves me in no doubt that full Windows could technically run on almost any size of device; but that does not make it the sensible choice for all form factors. Note also that while it was not mentioned at CES, NVIDIA has said that Tegra 2 is optimized for Android.

Microsoft could plausibly have released a tablet based on the Windows Phone 7 OS, which is built for touch control, this year. Instead, it will be at least 2012 before we see a Windows 8 tablet, and we are taking it on trust that this will really work nicely with touch and not need a stylus dangling at the side. By then Apple will, I presume, be releasing iPad generation 3.

Putting this in a developer context, what is Microsoft’s mobile development platform? Silverlight and XNA? The full Windows native API? Or HTML 5? Each of these is very different and it seems to me a muddled story.

NVIDIA’s first CPU, Project Denver, aims to bring ARM to desktops and servers

At CES in Las Vegas today NVIDIA’s CEO Jen-Hsun Huang announced the company’s first CPU: Project Denver. This is a partnership with ARM, to create “a full custom processor” targeting “high performance computing – servers, PCs, super-computers, cloud computing.” NVIDIA will still licence ARM processors for mobile computing.

Since ARM has in the past focused on the mobile and embedded market, and NVIDIA on GPUs, it is a departure for both companies.

Why? Huang says it is because ARM is “the new standard microprocessor architecture.” Judging by this chart, shown at the press briefing, it is hard to disagree:


In a few years, said Huang, “There will be more ARM processors shipped than all the x86 chips ever shipped.”


NVIDIA’s press release explains that the purpose of Project Denver is to extend the range of ARM systems upwards:

For several years, makers of high-end computing platforms have had no choice about instruction-set architecture.  The only option was the x86 instruction set with variable-length instructions, a small register set, and other features that interfered with modern compiler optimizations, required a larger area for instruction decoding, and substantially reduced energy efficiency.

Denver provides a choice.   System builders can now choose a high-performance processor based on a RISC instruction set with modern features such as fixed-width instructions, predication, and a large general register file.   These features enable advanced compiler techniques and simplify implementation, ultimately leading to higher performance and a more energy-efficient processor.

The other interesting aspect of Project Denver is its integration with the GPU – as you would expect from NVIDIA:

An ARM processor coupled with an NVIDIA GPU represents the computing platform of the future.  A high-performance CPU with a standard instruction set will run the serial parts of applications and provide compatibility while a highly-parallel, highly-efficient GPU will run the parallel portions of programs.

While we tend to focus most on power efficiency for mobile devices, because we notice how long our batteries last, it is equally important for larger systems. Power consumption and dealing with heat is a key issue for datacentres, while in everyday desktop computing power consumption is a significant proportion of the running cost of an IT system.

Project Denver puts a different spin on Microsoft’s Windows-on-ARM announcement today. The assumption is that Microsoft has in mind a mobile future for Windows; but if Denver takes off it could be important on desktops and servers as well.

Before getting too excited, it is worth recalling how Intel’s Itanium, cruelly dubbed the Itanic, mostly failed in the market. That was partly thanks to design problems, and partly because the industry was too deeply hooked into x86 applications. I also recall Motorola’s doomed attempts to sell Windows NT on PowerPC in the mid Nineties.

Denver could fare better, thanks to the ubiquity of ARM in the mobile world. That said, much will depend on whether a Denver-based system really does offer significant benefits over whatever Intel and/or AMD will have come up with by the time it ships. If it is less than spectacular, Denver will be a hard sell.