Tag Archives: high performance computing

Intel Xeon Phi shines vs NVidia GPU accelerators in Ohio State University tests

Which is better for massively parallel computing, a GPU accelerator board from NVidia, or Intel’s new Xeon Phi? On the eve of NVidia’s GPU Technology Conference comes a paper which Intel will enjoy. Erik Sauley, Kamer Kayay, and Umit V. C atalyurek from the Ohio State University have issued a paper with performance comparisons between Xeon Phi, NVIDIA Tesla C2050 and NVIDIA Tesla K20. The K20 has 2,496 CUDA cores, versus a mere 61 processor cores on the Xeon Phi, yet on the particular calculations under test the researchers got generally better performance from Xeon Phi.

In the case of sparse-matrix vector multiplication (SpMV):

For GPU architectures, the K20 card is typically faster than the C2050 card. It performs better for 18 of the 22 instances. It obtains between 4.9 and 13.2GFlop/s and the highest performance on 9 of the instances. Xeon Phi reaches the highest performance on 12 of the instances and it is the only architecture which can obtain more than 15GFlop/s.

and in the case of sparse-matrix matrix multiplication (SpMM):

The K20 GPU is often more than twice faster than C2050, which is much better compared with their relative performances in SpMV. The Xeon Phi coprocessor gets
the best performance in 14 instances where this number is 5 and 3 for the CPU and GPU configurations, respectively. Intel Xeon Phi is the only architecture which achieves more than 100GFlop/s.

Note that this is a limited test, and that the authors note that SpMV computation is known to be a difficult case for GPU computing:

the irregularity and sparsity of SpMV-like kernels create several problems for these architectures.

They also note that memory latency is the biggest factor slowing performance:

At last, for most instances, the SpMV kernel appears to be memory latency bound rather than memory bandwidth bound

It is difficult to compare like with like. The Xeon Phi implementation uses OpenMP, whereas the GPU implementation uses CuSparse. I would also be interested to know whether as much effort was made to optimise for the GPU as for the Xeon Phi.

Still, this is a real-world test that, if nothing else, demonstrates that in the right circumstances the smaller number of cores in a Xeon Phi do not prevent it comparing favourably against a GPU accelerator:

When compared with cutting-edge processors and accelerators, its SpMV, and especially SpMM, performance are superior thanks to its wide registers
and vectorization capabilities. We believe that Xeon Phi will gain more interest in HPC community in the near future.

Images of Eurora, the world’s greenest supercomputer

Yesterday I was in Bologna for the press launch of Eurora at Cineca, a non-profit consortium of universities and other public bodies. The claim is that Eurora is the world’s greenest supercomputer.

image

Eurora is a prototype deployment of Aurora Tigon, made by Eurotech. It is a hybrid supercomputer, with 128 CPUs supplemented by 128 NVidia Kepler K20 GPUs.

What makes it green? Of course, being new is good, as processor efficiency improves with every release, and “green-ness” is measured in floating point operations per watt. Eurora does 3150 Mflop/s per watt.

There is more though. Eurotech is a believer in water cooling, which is more efficient than air. Further, it is easier to do something useful with the hot water you generate than with hot air, such as generating energy.

Other factors include underclocking slightly, and supplying 48 volt DC power in order to avoid power conversion steps.

Eurora is composed of 64 nodes. Each node has a board with 2 Intel Xeon E5-2687W CPUs, an Altera Stratix V FPGA (Field Programmable Gate Array), an SSD drive, and RAM soldered to the board; apparently soldering the RAM is more efficient than using DIMMs.

image

Here is the FPGA:

image

and one of the Intel-confidential CPUs:

image

On top of this board goes a water-cooled metal block. This presses against the CPU and other components for efficient heat exchange. There is no fan.

Then on top of that go the K20 GPU accelerator boards. The design means that these can be changed for Intel Xeon Phi accelerator boards. Eurotech is neutral in the NVidia vs Intel accelerator wars.

image

Here you can see where the water enters and leaves the heatsink. When you plug a node into the rack, you connect it to the plumbing as well as the electrics.

image

Here are 8 nodes in a rack.

image

Under the floor is a whole lot more plumbing. This is inside the Aurora cabinet where pipes and wires rise from the floor.

image

Here is a look under the floor outside the cabinet.

image

while at the corner of the room is a sort of pump room that pumps the water, monitors the system, adds chemicals to prevent algae from growing, and no doubt a few other things.

image

The press was asked NOT to operate this big red switch:

image

I am not sure whether the switch we were not meant to operate is the upper red button, or the lower red lever. To be on the safe side, I left them both as-is.

So here is a thought. Apparently Eurora is 15 times more energy-efficient than a typical desktop. If the mobile revolution continues and we all use tablets, which also tend to be relatively energy-efficient, could we save substantial energy by using the cloud when we need more grunt (whether processing or video) than a tablet can provide?

Is the triumph of the GPU the failure of the CPU?

I’m at NVIDIA’s GPU tech conference in San Jose. The central theme of the conference is that the capabilities of modern GPUs enable substantial performance gains for general computing, not just for graphics, though most of the examples we have seen involve some element of graphical processing. The reason you should care about this is that the gains are huge.

Take Matlab for example, a popular language and IDE for algorithm development, data analysis and mathematical computation. We were told in the keynote here yesterday that Matlab is offering a parallel computing toolkit based on NVIDIA’s CUDA, with speed-ups from 10 to 40 times. Dramatic performance improvements opens up new possibilities in computing.

Why has GPU performance advanced so rapidly, whereas CPU performance has levelled off? The reason is that they use different computing models. CPUs are general-purpose. The focus is on fast serial computation, executing a single thread as rapidly as possible. Since many applications are largely single-thread, this is what we need, but there are technical barriers to increasing clock speed. Of course multi-core and multi-processor systems are now standard, so we have dual-core or quad-core machines, with big performance gains for multi-threaded applications.

By contrast, GPUs are designed to be massively parallel. A Tesla C1060 has not 2 or 4 or 8 cores, but 240; the C2050 has 448. These are not the same as CPU cores, but nevertheless do execute in parallel. The clock speed is only 1.3Ghz, whereas an Intel Core i7 Extreme is 3.3Ghz, but the Intel CPU has a mere 6 cores.  An Intel Xeon 7560 runs at 2.266 Ghz and has 8 cores.The lower clock speed in the GPU is one reason it is more power-efficient.

NVIDIA’s CUDA initiative is about making this capability available to any application. NVIDIA made changes to its hardware to make it more amenable to standard C code, and delivered CUDA C with extensions to support it. In essence it is pretty simple. The extensions let you specify functions to execute on the GPU, allocate memory for pointers on the GPU, and copy memory between the GPU (called the device) and the main memory on the PC (called the host). You can also synchronize threads and use shared memory between threads.

The reward is great performance, but there are several disadvantages. One is the challenge of concurrent programming and the subtle bugs it can introduce.

Another is the hassle of copying memory between host and device. The device is in effect a computer within a computer. Shifting data between the two is relatively show.

A third is that CUDA is proprietary to NVIDIA. If you want your code to work with ATI’s equivalent, called Streams, then you should use the OpenCL library, though I’ve noticed that most people here seem to use CUDA; I presume they are able to specify the hardware and would rather avoid the compromises of a cross-GPU library. In the worst case, if you need to support both CUDA and non-CUDA systems, you might need to support different code paths depending on what is detected at runtime.

It is all a bit messy, though there are tools and libraries to simplify the task. For example, this morning we heard about GMAC, which makes host and device appear to use a single address space, though I imagine there are performance implications.

NVIDIA says it is democratizing supercomputing, bringing high performance computing within reach for almost anyone. There is something in that; but at the same time as a developer I would rather not think about whether my code will execute on the CPU or the GPU. Viewed at the highest level, I find it disappointing that to get great performance I need to bolster the capabilities of the CPU with a specialist add-on. The triumph of the GPU is in a sense the failure of the CPU. Convergence in some form or other strikes me as inevitable.