Multicore processor wars: NVIDIA squares up to Intel

I first became aware of NVIDIA’s propaganda war against Intel at the 2012 GPU Technology conference in Beijing. CEO Jen-Hsun Huang stated that CPUs are remarkably inefficient for multicore processing:

The CPU is fast and is terrific at single-threaded performance, but because so much of the electronics inside the CPU is dedicated to out of order execution, branch prediction, speculative execution, all of the technology that has gone into sustaining instruction throughput and making the CPU faster at single-threaded applications, the electronics necessary to enable it to do that has grown tremendously. With four cores, in order to execute an operation, a floating point add or a floating point multiply, 50 times more energy is dedicated to the scheduling of that operation than the operation itself. If you look at the silicone of a CPU, the floating point unit is only a few percentage of the overall die, and it is consistent with the usage of the energy to sequence, to schedule the instructions running complicated programs.

That figure of 50 times surprised me, and I asked Intel’s James Reinders for a comment. He was quick to respond, noting that:

50X is ridiculous if it encourages you to believe that there is an alternative which is 50X better.  The argument he makes, for a power-efficient approach for parallel processing, is worth about 2X (give or take a little). The best example of this, it turns out, is the Intel MIC [Many Integrated Core] architecture.

Reinders went on to say:

Knights Corner is superior to any GPGPU type solution for two reasons: (1) we don’t have the extra power-sucking silicon wasted on graphics functionality when all we want to do is compute in a power efficient manner, and (2) we can dedicate our design to being highly programmable because we aren’t a GPU (we’re an x86 core – a Pentium-like core for “in order” power efficiency). These two turn out to be substantial advantages that the Intel MIC architecture has over GPGPU solutions that will allow it to have the power efficiency we all want for highly parallel workloads, but able to run an enormous volume of code that will never run on GPGPUs (and every algorithm that can run on GPGPUs will certainly be able to run on a MIC co-processor).

So Intel is evangelising its MIC vs GPCPU solutions such as NVIDIA’s Tesla line. Yesterday NVIDIA’s Steve Scott spoke up to put the other case. If Intel’s point is that a Tesla is really a GPU pressed into service for general computing, then Scott’s first point is that the cores in MIC are really CPUs, albeit of an older, simpler design:

They don’t really have the equivalent of a throughput-optimized GPU core, but were able to go back to a 15+ year-old Pentium design to get a simpler processor core, and then marry it with a wide vector unit to get higher flops per watt than can be achieved by Xeon processors.

Scott then takes on Intel’s most compelling claim, compatibility with existing x86 code. It does not matter much, says Scott, since you will have to change your code anyway:

The reality is that there is no such thing as a “magic” compiler that will automatically parallelize your code. No future processor or system (from Intel, NVIDIA, or anyone else) is going to relieve today’s programmers from the hard work of preparing their applications for the future.

What is the real story here? It would, of course, be most interesting to compare the performance of MIC vs Tesla, or against the next generation of NVIDIA GPGPUs based on Kepler; and may the fastest and most power-efficient win. That will have to wait though; in the meantime we can see that Intel is not enjoying seeing the world’s supercomputers install NVIDIA GPGPUs – the Oak Ridge National Laboratory Jaguar/Titan (the most powerful supercomputer in the USA) being a high profile example:

In addition, 960 of Jaguar’s 18,688 compute nodes now contain an NVIDIA graphical processing unit (GPU). The GPUs were added to the system in anticipation of a much larger GPU installation later in the year.

Equally, NVIDIA may be rattled by the prospect of Intel offering strong competition for Tesla. It has not had a lot of competition in this space.

There is an ARM factor here too. When I spoke to Scott in Beijing, he hinted that NVIDIA would one day produce GPGPUs with ARM chips embedded for CPU duties, perhaps sharing the same memory.