Category Archives: gpu computing

China’s Tianhe-2 Supercomputer takes top ranking, a win for Intel vs Nvidia

The International Supercomputing Conference (ISC) is under way in Leipzig, and one of the announcements is that China’s Tianhe-2 is now the world’s fastest supercomputer according to the Top 500 list.

This has some personal interest for me, as I visited its predecessor Tianhe-1A in December 2011, on a press briefing organised by NVidia which was, I guess, on a diplomatic mission to promote Tesla, the GPU accelerator boards used in Tianhe-1A (which was itself the world’s fastest supercomputer for a period).

It appears that the mission failed, insofar as Tianhe-2 uses Intel Phi accelerator boards rather than Nvidia Tesla.

Tianhe-2 has 16,000 nodes, each with two Intel Xeon IvyBridge processors and three Xeon Phi processors for a combined total of 3,120,000 computing cores.

says the press release. Previously, the world’s fastest was the US Titan, which does use NVidia GPUs.

Nvidia has reason to worry. Tesla boards are present on 39 of the top 500, whereas Xeon Phi is only on 11, but it has not been out for long and is growing fast. A newly published paper shows Xeon Phi besting Tesla on sparse matrix-vector multiplication:

we demonstrate that our implementation is 3.52x and 1.32x faster, respectively, than the best available implementations on dual IntelR XeonR Processor E5-2680 and the NVIDIA Tesla K20X architecture.

In addition, Intel has just announced the successor to Xeon Phi, codenamed Knight’s Landing. Knight’s Landing can function as the host CPU as well as an accelerator board, and has integrated on-package memory to reduce data transfer bottlenecks.

Nvidia does not agree that Xeon Phi is faster:

The Tesla K20X is about 50% faster in Linpack performance, and in terms of real application performance we’re seeing from 2x to 5x faster performance using K20X versus Xeon Phi accelerator.

says the company’s Roy Kim, Tesla product manager. The truth I suspect is that it depends on the type of workload and I would welcome more detail on this.

It is also worth noting that Tianhe-2 does not better Titan on power/performance ratio.

  • Tianhe-2: 3,120,00 cores, 1,024,000 GB Memory, Linpack perf 33,862.7 TFlop/s, Power 17,808 kW.
  • Titan: 560,640 cores, 710,144 GB Memory, Linpack perf 17,590 TFlop/s, Power 8,209 kW.

NVIDIA’s Visual Computing Appliance: high-end virtual graphics power on tap

NVIDIA CEO Jen-Hsun Huang has announced the Grid Visual Computing Appliance (VCA). Install one of these, and users anywhere on the network can run graphically-demanding applications on their Mac, PC or tablet. The Grid VCA is based on remote graphics technology announced at last year’s GPU Technology Conference. This year’s event is currently under way in San Jose.

The Grid VCA is a 4U rack-mounted server.


Inside are up to 2 Xeon CPUs each supporting 16 threads, and up to 8 Grid GPU boards each containing 2 Kepler GPUs each with 4GB GPU memory. There is up to 384GB of system RAM.


There is a built-in hypervisor (I am not sure which hypervisor NVIDIA is using) which supports 16 virtual machines and therefore up to 16 concurrent users.

NVIDIA supplies a Grid client for Mac, Windows or Android (no mention of Apple iOS).

During the announcement, NVIDIA demonstrated a Mac running several simultaneous Grid sessions. The virtual machines were running Windows with applications including Autodesk 3D Studio Max and Adobe Premier. This looks like a great way to run Windows on a Mac.


The Grid VCA is currently in beta, and when available will cost from $24,900 plus $2,400/yr software licenses. It looks as if the software licenses are priced at $300 per concurrent user, since the price doubles to $4,800/Yr for the box which supports 16 concurrent users.


Businesses will need to do the arithmetic and see if this makes sense for them. Conceptually it strikes me as excellent, enabling one centralised GPU server to provide high-end graphics to anyone on the network, subject to the concurrent user limitation. It also enables graphically demanding Windows-only applications to run well on Macs.

The Grid VCA is part of the NVIDIA GRID Enterprise Ecosystem, which the company says is supported by partners including Citrix, Dell, Cisco, Microsoft, VMWare, IBM and HP.


Big GPU news at NVIDIA tech conference including first Tegra with CUDA

NVIDIA CEO Jen-Hsun Huang made a number of announcements at the GPU Technology Conference (GTC) keynote yesterday, including an updated roadmap for both desktop and mobile GPUs.


Although the focus of the GTC is on high-performance computing using Tesla GPU accelerator boards, Huang’s announcements were not limited to that area but also covered the company’s progress on mobile and on the desktop. Huang opened by mentioning the recently released GeForce Titan graphics processor which has 2,600 CUDA cores, and which starts from under £700 so is within reach of serious gamers as well as developers who can make use of it for general-purpose computing. CUDA enables use of the GPU for massively parallel general-purpose computing. NVIDIA is having problems keeping up with demand, said Huang.

There are now 430 million CUDA capable GPUs out there, said Huang, including 50 supercomputers, and coverage in 640 university courses.


He also mentioned last week’s announcement of the Swiss Piz Daint supercomputer which will include Tesla K20X GPU accelerators and will be operational in early 2014.

But what is coming next? Here is the latest GPU roadmap:


Kepler is the current GPU architecture, which introduced dynamic parallelism, the ability for the GPU to generate work without transitioning back to the CPU.

Coming next is Maxwell, which has unified virtual memory. The GPU can see the CPU memory, and the CPU can see the GPU memory, making programming easier. I am not sure how this impacts performance, but note that it is unified virtual memory, so the task of copying data between host and device still exists under the covers.

After Maxwell comes Volta, which focuses on increasing memory bandwidth and reducing latency. Volta includes a stack of DRAM on the same silicon substrate as the GPU, which Huang said enables 1TB per second of memory bandwidth.

What about mobile? NVIDIA is aware of the growth in devices of all kinds. 2.5bn high definition displays are sold each year, said Huang, and this will double again by 2015. These displays are mostly not for PCs, but on smartphones or embedded devices.

Here is the roadmap for Tegra, NVIDIA’s system-on-a-chip (SoC).


Tegra 4, which I saw in preview at last month’s mobile world congress in Barcelona, includes a software-defined modem and computational camera, able to tracks moving objects while keeping them in focus.

Next is Tegra Logan. This is the first Tegra to include CUDA cores so you can use it for general-purpose computing. It  is based on the Kepler GPU and supports full CUDA 5 computing as well as Open GL 4.3. Logan with be previewed this year and in production early 2014.

After Logan comes Parker. This will be based on the Maxwell GPU (see above) and NVIDIA’s own Denver (ARM-based) CPU. It will include FinFET multigate transistors.

According to Huang, Tegra performance will includes by 100 times over 5 years. Today’s Surface RT (which runs Tegra 3) may be sluggish, but Windows RT will run fine on these future SoCs. Of course Intel is not standing still either.

Finally, Huang announced the Grid Visual Computing Appliance, which I will be covering shortly in another post.

Images of Eurora, the world’s greenest supercomputer

Yesterday I was in Bologna for the press launch of Eurora at Cineca, a non-profit consortium of universities and other public bodies. The claim is that Eurora is the world’s greenest supercomputer.


Eurora is a prototype deployment of Aurora Tigon, made by Eurotech. It is a hybrid supercomputer, with 128 CPUs supplemented by 128 NVidia Kepler K20 GPUs.

What makes it green? Of course, being new is good, as processor efficiency improves with every release, and “green-ness” is measured in floating point operations per watt. Eurora does 3150 Mflop/s per watt.

There is more though. Eurotech is a believer in water cooling, which is more efficient than air. Further, it is easier to do something useful with the hot water you generate than with hot air, such as generating energy.

Other factors include underclocking slightly, and supplying 48 volt DC power in order to avoid power conversion steps.

Eurora is composed of 64 nodes. Each node has a board with 2 Intel Xeon E5-2687W CPUs, an Altera Stratix V FPGA (Field Programmable Gate Array), an SSD drive, and RAM soldered to the board; apparently soldering the RAM is more efficient than using DIMMs.


Here is the FPGA:


and one of the Intel-confidential CPUs:


On top of this board goes a water-cooled metal block. This presses against the CPU and other components for efficient heat exchange. There is no fan.

Then on top of that go the K20 GPU accelerator boards. The design means that these can be changed for Intel Xeon Phi accelerator boards. Eurotech is neutral in the NVidia vs Intel accelerator wars.


Here you can see where the water enters and leaves the heatsink. When you plug a node into the rack, you connect it to the plumbing as well as the electrics.


Here are 8 nodes in a rack.


Under the floor is a whole lot more plumbing. This is inside the Aurora cabinet where pipes and wires rise from the floor.


Here is a look under the floor outside the cabinet.


while at the corner of the room is a sort of pump room that pumps the water, monitors the system, adds chemicals to prevent algae from growing, and no doubt a few other things.


The press was asked NOT to operate this big red switch:


I am not sure whether the switch we were not meant to operate is the upper red button, or the lower red lever. To be on the safe side, I left them both as-is.

So here is a thought. Apparently Eurora is 15 times more energy-efficient than a typical desktop. If the mobile revolution continues and we all use tablets, which also tend to be relatively energy-efficient, could we save substantial energy by using the cloud when we need more grunt (whether processing or video) than a tablet can provide?

Exascale computing: you could do it today if you could supply the power says Nvidia

Nvidia’s Bill Dally has posted about the company’s progress towards exascale computing, boosted by a $12.4 million grant from the U.S. Department of Energy. He mentions that it would be possible to build an exascale supercomputer today, if you could supply enough power:

Exascale systems will perform a quintillion floating point calculations per second (that’s a billion billion), making them 1,000 times faster than a one petaflop supercomputer. The world’s fastest computer today is about 16 petaflops.

One of the great challenges in developing such systems is in making them energy efficient. Theoretically, an exascale system could be built with x86 processors today, but it would require as much as 2 gigawatts of power — the entire output of the Hoover Dam. The GPUs in an exascale system built with NVIDIA Kepler K20 processors would consume about 150 megawatts. The DOE’s goal is to facilitate the development of exascale systems that consume less than 20 megawatts by the end of the decade.

If the industry succeeds in driving down supercomputer power consumption to one fortieth of what it is today, I guess it also follows that tablets like the one on which I am typing now will benefit from much greater power efficiency. This stuff matters, and not just in the HPC (High Performance Computing) market.

NVIDIA Nsight comes to Eclipse for Mac, Linux GPU programming

NVIDIA has ported its Nsight development tools, previously a plug-in for Visual Studio, to run within the open source Eclipse IDE for use on Mac and Linux.


The Nsight tools include profiling, refactoring, syntax highlighting and auto-completion, as well as a bunch of code samples.

The Windows version for Visual Studio has also been updated, and now supports local GPU debugging as well as new support for DirectX frame debugging and analysis.

Although Eclipse of course runs on Windows, Nsight users should continue to use the Visual Studio version. NVIDIA is not supporting use of the Eclipse Nsight on Windows.

The tools are in preview and you can sign up to try them here.

Another significant development is the availability of the CUDA LLVM Compiler. NVIDIA has contributed CUDA compiler code to the open source LLVM project. This means that other languages which compile to LLVM intermediate assembly language can be adapted to support parallel processing on NVIDIA GPUs. The CUDA Compiler SDK will be made available this week at the NVIDIA GPU Technology Conference in San Jose.

Adobe turns to OpenCL rather than NVIDIA CUDA for Mercury Graphics Engine in Creative Suite 6

Adobe has just announced Creative Suite 6. CS 5.5 used the Mercury Playback Engine in Premiere Pro, which takes advantage of NVIDIA’s CUDA library in order to accelerate processing when an NVIDIA GPU is present. Just to be clear, this is not just graphics acceleration, but programming the GPU to take advantage of its many processor cores for general-purpose computing.

Premiere Pro CS6 also uses the Mercury Playback Engine, and while CUDA is still recommended there is new support for OpenCL:

The Mercury Playback Engine brings performance gains to all the GPUs supported in Adobe Creative Suite 6 software, but the best performance comes with specific NVIDIA® CUDA™ enabled GPUs, including support for mobile GPUs and NVIDIA Maximus™ dual-GPU configurations. New support for the OpenCL-based AMD Radeon HD 6750M and 6770M cards available with certain Apple MacBook Pro computers running OS X Lion (v10.7x), with a minimum of 1GB VRAM, brings GPU-accelerated mobile workflows to Mac users.

PhotoShop CS6 also uses the GPU to accelerate processing, using the new Mercury Graphics Engine. The Mercury Graphics Engine uses the OpenCL framework, which is not specific to any one GPU vendor, rather than CUDA:

The Mercury Graphics Engine (MGE) represents features that use video card, or GPU, acceleration. In Photoshop CS6, this new engine delivers near-instant results when editing with key tools such as Liquify, Warp, Lighting Effects and the Oil Paint filter. The new MGE delivers unprecedented responsiveness for a fluid feel as you work. MGE is new to Photoshop CS6, and uses both the OpenGL and OpenCL frameworks. It does not use the proprietary CUDA framework from nVidia.

It seems to me that this amounts to a shift by Adobe from CUDA to OpenCL, which is a good thing for users of non-NVIDIA GPUs.

This also suggests to me that NVIDIA will need to ensure excellent OpenCL support in its GPU cards, as well as continuing to evolve CUDA, since Creative Suite is a key product for designers using the workstations which form a substantial part of the market for high-end GPUs.