Tag Archives: supercomputing

China’s Tianhe-2 Supercomputer takes top ranking, a win for Intel vs Nvidia

The International Supercomputing Conference (ISC) is under way in Leipzig, and one of the announcements is that China’s Tianhe-2 is now the world’s fastest supercomputer according to the Top 500 list.

This has some personal interest for me, as I visited its predecessor Tianhe-1A in December 2011, on a press briefing organised by NVidia which was, I guess, on a diplomatic mission to promote Tesla, the GPU accelerator boards used in Tianhe-1A (which was itself the world’s fastest supercomputer for a period).

It appears that the mission failed, insofar as Tianhe-2 uses Intel Phi accelerator boards rather than Nvidia Tesla.

Tianhe-2 has 16,000 nodes, each with two Intel Xeon IvyBridge processors and three Xeon Phi processors for a combined total of 3,120,000 computing cores.

says the press release. Previously, the world’s fastest was the US Titan, which does use NVidia GPUs.

Nvidia has reason to worry. Tesla boards are present on 39 of the top 500, whereas Xeon Phi is only on 11, but it has not been out for long and is growing fast. A newly published paper shows Xeon Phi besting Tesla on sparse matrix-vector multiplication:

we demonstrate that our implementation is 3.52x and 1.32x faster, respectively, than the best available implementations on dual IntelR XeonR Processor E5-2680 and the NVIDIA Tesla K20X architecture.

In addition, Intel has just announced the successor to Xeon Phi, codenamed Knight’s Landing. Knight’s Landing can function as the host CPU as well as an accelerator board, and has integrated on-package memory to reduce data transfer bottlenecks.

Nvidia does not agree that Xeon Phi is faster:

The Tesla K20X is about 50% faster in Linpack performance, and in terms of real application performance we’re seeing from 2x to 5x faster performance using K20X versus Xeon Phi accelerator.

says the company’s Roy Kim, Tesla product manager. The truth I suspect is that it depends on the type of workload and I would welcome more detail on this.

It is also worth noting that Tianhe-2 does not better Titan on power/performance ratio.

  • Tianhe-2: 3,120,00 cores, 1,024,000 GB Memory, Linpack perf 33,862.7 TFlop/s, Power 17,808 kW.
  • Titan: 560,640 cores, 710,144 GB Memory, Linpack perf 17,590 TFlop/s, Power 8,209 kW.

On Supercomputers, China’s Tianhe-1A in particular, and why you should think twice before going to see one

I am just back from Beijing courtesy of Nvidia; I attended the GPU Technology conference and also got to see not one but two supercomputers:  Mole-8.5 in Beijing and Tianhe-1A in Tianjin, a coach ride away.

Mole-8.5 is currently at no. 21 and Tianhe-1A at no. 2 on the top 500 list of the world’s fastest supercomputers.

There was a reason Nvidia took journalists along, of course. Both are powered partly by Nvidia Tesla GPUs, and it is part of the company’s campaign to convince the world that GPUs are essential for supercomputing, because of their greater efficiency than CPUs. Intel says we should wait for its MIC (Many Integrated Core) CPU instead; but  Nvidia has a point, and increasing numbers of supercomputers are plugging in thousands of Nvidia GPUs. That does not include the world’s current no. 1, Japan’s K Computer, but it will include the USA’s Titan, currently no. 3, which will add up to 18.000 GPUs in 2012 with plans that may take it to the top spot; we were told that that it aims to be twice as fast as the K Computer.

Supercomputers are important. They excel at processing large amounts of data, so typical applications are climate research, biomedical research, simulations of all kinds used for design and engineering, energy modelling, and so on. These efforts are important to the human race, so you will never catch me saying that supercomputers are esoteric and of no interest to most of us.

That said, supercomputers are physically little different from any other datacenter: rows of racks. Here is a bit of Mole-8.5:


and here is a bit of Tianhe-1A:


In some ways Tianhe-1A is more striking from outside.


If you are interested in datacenters, how they are cooled, how they are powered, how they are constructed, then you will enjoy a visit to a supercomputer. Otherwise you may find it disappointing, especially given that you can run an application on a supercomputer without any need to be there physically.

Of course there is still value in going to a supercomputing centre to talk to the people who run it and find out more about how the system is put together. Again though I should warn you that physically a supercomputer is repetitive. They achieve their mighty flop/s (floating point per second) counts by having lots and lots of processors (whether CPU or GPU) running in parallel. You can make a supercomputer faster by adding another cupboard with another set of racks with more boards with CPUs


or GPUs


and provided your design is right you will get more flop/s.

Yes there is more to it than that, and points of interest include the speed of the network, which is critical in order to support high performance, as well as the software that manages it. Take a look at the K Computer’s Tofu Interconnect. But the term “supercomputer” is a little misleading: we are talking about a network of nodes rather than a single amazing monolithic machine.

Personally I enjoyed the tours, though the visit to Tianhe-1A was among the more curious visits I have experienced. We visited along with a bunch of Nvidia executives. The execs sat along one side of a conference table, the Chinese hosts along the other side, and they engaged in a diplomatic exercise of being very polite to each other while the journalists milled around the room.


We did get a tour of Tianhe-1A but unfortunately little chance to talk to the people involved, though we did have a short group interview with the project director, Liu Guangming.


He gave us short, guarded but precise answers, speaking through an interpreter. We asked about funding. “The way things work here is different from how it works in the USA,” he said, “The government supports us a lot, the building and infrastructure, all the machines, are all paid for by the government. The government also pays for the operational cost.” Nevertheless, users are charged for their time on Tianhe-1A, but this is to promote efficiency. “If users pay they use the system more efficiently, that is the reason for the charge,” he said. However, the users also get their funding from the government’s research budget.

Downplayed on the slides, but mentioned here, is the fact that the supercomputer was developed by the “National team of defence technology.” Food for thought.

We also asked about the usage of the GPU nodes as opposed to the CPU nodes, having noticed that many of the applications presented in the briefing were CPU-only. “The GPU stage is somewhat experimental,” he said, though he is “seeing increasing use of the GPU, and such a heterogeneous system should be the future of HPC [High Performance Computing].” Some applications do use the GPU and the results have been good. Overall the system has 60-70% sustained utilisation.

Another key topic: might China develop its own GPU? Tianhe-1A already includes 2048 China-designed “Galaxy FT” CPUs, alongside 14336 Intel CPUs and 7168 NVIDIA GPUS.

We already have the technology, said Guangming.

From 2005 -7 we designed a chip, a stream processor similar to a GPU. But the peak performance was not that good. We tried AMD GPUs, but they do not have EEC [Extended Error Correction], so that is why we went to NVIDIA. China does have the technology to make GPUs. Also the technology is growing, but what we implement is a commercial decision.

Liu Guangming closed with a short speech.

Many of the people from outside China might think that China’s HPC experienced explosive development last year. But China has been involved in HPC for 20 years. Next, the Chinese government is highly committed to HPC. Third, the economy is growing fast and we see the demand for HPC. These factors have produced the explosive growth you witnessed.

The Tianjin Supercomputer is open and you are welcome to visit.

NVIDIA CEO Jen-Hsun Huang beats the drum for GPU computing

In his keynote at the GPU Technology Conference here in Beijing NVIIDA CEO Jens-Hsun Huang presented the simple logic of GPU computing. The main constraint on computing is power consumption, he said:

Power is now the limiter of every computing platform, from cellphones to PCs and even datacenters.

CPUs are optimized for single-threaded computing and are relatively inefficient. According to Huang a CPU spends 50 times as much power scheduling instructions as it does executing them. A GPU by contrast is formed of many simple processors and is optimized for parallel processing, making it more efficient when measured in FLOP/s (Floating Point Operations per Second), a way of benchmarking computer performance. Therefore it is inevitable that computers make use of GPU computing in order to achieve best performance. Note that this does not mean dispensing with the CPU, but rather handing off processing to the GPU when appropriate.

This point is now accepted in the world of supercomputers. The computer at Chinese National Supercomputing Center in Tianjin has 14,336 Intel CPUs, 7168 Nvidia Tesla GPUs, and 2048 custom-designed 8-core CPUs called Galaxy FT-1000, and can achieve 4.7 Petaflop/s for a power consumption of 4.04 MegaWatts (million watts), as presented this morning by the center’s Vice Director Xiaoquian Zhu. This is currently the 2nd fastest supercomputer in the world.

Huang says that without GPUs the world would wait until 2035 for the first Exascale (1 Exaflop/s) supercomputer, presuming a power constraint of 20MW and current levels of performance improvement year by year, whereas by combining CPUs with GPUs this can be achieved in 2019.

Supercomputing is only half of the GPU computing story. More interesting for most users is the way this technology trickles down to the kind of computers we actually use. For example, today Lenovo announced several workstations which use NVIDIA’s Maximus technology to combine a GPU designed primarily for driving a display (Quadro) with a GPU designed primarily for GPU computing (Tesla). These workstations are aimed at design professionals, for whom the ability to render detailed designs quickly is important. The image below shows a Lenovo S20 on display here. Maybe these are not quite everyday computers, but they are still PCs. Approximate price to follow soon when I have had a chance to ask Lenovo. Update: prices start at around $4500 for an S20, with most of the cost being for the Tesla board.


GPU programming coming to low-power and mobile devices – from EU Mont Blanc supercomputer to smartphones

Supercomputing and low-power computing are not normally associated; but at the SC11 Supercomputing conference the Barcelona Supercomputing Center (BSC) has announced a new supercomputer, called the called the Mont-Blanc Project, which will combine the ARM-based NVIDIA Tegra SoC with separate CUDA GPUs. CUDA is NVIDIA’s parallel computing architecture, enabling general purpose computing on the GPU.

The project’s publicity says this enables power saving of 15 to 30 times, versus today’s supercomputers:

The analysis of the performance of HPC systems since 1993 shows exponential improvements at the rate of one order of magnitude every 3 years: One petaflops was achieved in 2008, one exaflops is expected in 2020. Based on a 20 MW power budget, this requires an efficiency of 50 GFLOPS/Watt. However, the current leader in energy efficiency achieves only 1.7n GFLOPS/Watt. Thus, a 30x improvement is required.

NVIDIA is also creating a new hardware and software development kit for Tegra + CUDA, to be made available in the first half of 2012.


The combination of fast concurrent processing, low power draw and mobile devices is enticing. Features like speech recognition and smart cameras depend on rapid processing, and the technology has the potential to make smart devices very much smarter.

NVIDIA has competition though. ARM, which designs most of the CPUs in use on smartphones and tablets today, has recently started designing mobile GPUs as well, and its Mali series supports OpenCL, an open alternative to CUDA for general-purpose computing on the GPU. The Mali-T604 has 1 to 4 cores while the recently announced Mali-T658 has 1 to 8 cores. ARM specifically optimises its GPUs to work alongside its CPUs, which must be a concern for GPU specialists such as NVIDIA. However, we have yet to see devices with either T604 or T658: the first T604 devices are likely to appear in 2012, and T658 in 2013.

New OpenACC compiler directives announced for GPU accelerated programming

A new standard for accelerating C/C++ programming with compiler directives has been announced at the SC11 Supercomputing conference in Seattle. The new standard is called OpenACC  and has been created by NVIDIA, Cray, PGI (Portland Group) and CAPS enterprise.

OpenACC compiler directives are code annotations that enable the compiler to parallelise code while ensuring thread-safety. The big difference between OpenACC and the existing OpenMP standard is that OpenACC primarily targets the GPU rather than CPU, whereas OpenMP is generally CPU only. That said, OpenACC can also target the CPU so it is flexible; the idea is that it will adapt to the target system.


OpenACC is “defined to be interoperable with OpenMP” according to the FAQ and the OpenACC CEO hopes for some future integration, though it seems to have been developed independently which may cause some tension.

OpenACC is expected to ship during the first half of 2012 on compilers from PGI, Cray and CAPS Enterprise. The NVIDIA involvement may make you wonder whether it is GPU-specific; the answer is “maybe”. The FAQ says:

Will OpenACC run on AMD GPUs?

– It could, it requires implementation, there is no reason why it couldn’t

Will OpenACC run on top of OpenCL?

– It could, it requires implementation, there is no reason why it couldn’t

Will AMD/Intel/MS/XX support this?

– As this is just announced we can’t speak to the rate of external adoption or participation.

Will OpenACC run on NVIDIA GPUs with CUDA?

– Yes. Programmers may wish to develop some code using directives, and more sophisticated code using CUDA C, CUDA C++ or CUDA Fortra

Spot the Yes in the above! Still, you can scarcely blame NVIDIA for supporting its own GPU family; and I have been impressed with how the company works with the scientific and academic community to realise the potential of massively parallel computing.

OpenACC is about democratising parallelism, rather than advancing the state of the art. Best optimisation is obtained by more complex programming, but directives make some remarkable performance improvements easy to achieve.