Microsoft announced today at CES in Las Vegas that the next version of Windows will run on ARM as well as Intel CPUs. But why? The reason is that ARM CPUs have huge momentum in mobile computing, thanks to their low power consumption. Microsoft wants Windows to support System on a Chip (SoC) architectures such as NVIDIA’s Tegra 2, which has two ARM Cortex-A9 CPUs combined with an HD-capable graphics processor in a single package. In its press release, the company is careful not to upset established x86/x64 partners Intel and AMD too much, emphasising that Windows will run on SoC packages based on those CPUs as well.
It is an interesting announcement, but one that raises as many questions as answers. The first concerns Microsoft’s mobile strategy, with Windows now seeming to encroach on territory that you have thought belonged to its embedded operating system, Windows CE, which underlies both Windows Mobile and Windows Phone 7. With all its legacy APIs, full-blown Windows does not seem ideal for low-powered, resource-constrained mobile devices; yet the company seems set on bringing full Windows rather than something based on Windows Phone 7 to the emerging tablet market.
The second issue is that applications will need at least re-compiling, and in many cases some re-coding, in order to run on ARM CPUs. Microsoft says it will deliver Office for ARM:
Sinofsky: Microsoft Office is an important part of customers’ PC experience and ensuring it runs natively on ARM is a natural extension of our Windows commitment to SoC architectures.
Windows and Office alone is enough for a decent business device; but customers who buy Windows on ARM expecting their existing games or applications to run will be disappointed.
We have been here before. In the early days of Windows CE, devices ran a variety of processors such as MIPS or Hitachi SH3, and developers had to compile multiple binaries and create setups that installed the right one on each device. In an attempt to overcome the friction this created, Microsoft introduced the Common Executable Format (CEF) with Windows CE 3.0 in 2000. This was an intermediate language format which was translated to native code by a “translator” when it was installed onto a device.
It sounds a bit like .NET or Java; and it was indeed a forerunner of the .NET Common Language Runtime, which appeared in 2002. However, CEF never really caught on. Although it solved deployment issues, it introduced performance problems and was troublesome to debug. Most developers preferred to stick with true native code.
Today though .NET is mature; and we also have Silverlight, a cross-platform implementation of the .NET Framework combined with multimedia player and graphics framework. If Microsoft includes .NET and Silverlight in its ARM build of Windows, that would solve some of the deployment problems, especially for business devices. Many custom applications are built for .NET; and these would in principle run without any need to recompile, since a .NET executable is intermediate code which is compiled to native code at runtime, though any code which includes “platform invoke” calls to native APIs would not work.
It is surprising therefore that neither .NET nor Silverlight is mentioned in Windows president Steve Sinofsky’s Q&A about Windows on ARM. Still, we should not read too much into that. It would be madness if Microsoft did not support its .NET technologies on this new platform, would it not?
GPU programming in the cloud makes sense in cases where you need the performance of a super-computer, but not very often. It could also enable some powerful mobile applications, maybe in financial analysis, or image manipulation, where you use a mobile device to input data and view the results, but cloud processing to do the heavy lifting.
One of the ideas I discussed with someone from Adobe at the NVIDIA GPU conference was to integrate a cloud processing service with PhotoShop, so you could send an image to the cloud, have some transformative magic done, and receive the processed image back.
The snag with this approach is that in many cases you have to shift a lot of data back and forth, which means you need a lot of bandwidth available before it makes sense. Still, Amazon has now provided the infrastructure to make processing as a service easy to offer. It is now over to the rest of us to find interesting ways to use it.
The most eye-opening demonstration at the NVIDIA GPU Technology Conference last week was from Adobe’s David Salesin (Sr. Principal Scientist) and Todor Georgiev (Sr Research Scientist), who showed their Plenoptic Lens along with software for processing the resulting images.
There was a gasp of amazement from the audience when we saw what the process is capable of. We saw an image refocused after the event.
For anyone who has ever taken an out of focus picture – which I guess is everyone – the immediate reaction is to want one NOW. Another appealing idea is to take an image that has several items of interest, but at different depths, and shift the focus from one to another.
So how does it work? It starts with the plenoptic lens, which lets you “capture multiple views of the scene from slightly different viewpoints,” said Salesin:
If you have a high resolution sensor then each one of those images can be fairly high resolution. The neat thing is that with software, with computation, you can put this together into one large high-resolution image.
In a sense you are capturing a whole 4D lightfield. You’ve got two dimensions of the spatial position of the light ray, and also two dimensions of the orientation of the light ray.
With that 4D image, you can then after the fact use computation to take the place of optics. With computation you have a lot more flexibility. You can change the vantage point, the viewpoint a little bit, and you can also change the focus.
To resolve that, to take these individual little pieces of an image and put them together into one large image from any arbitrary view with any arbitrary focus, it turns out that texture mapping hardware is exactly what you need to do that. Using GPU chips we’ve been able to get speedups over the CPU of about 500 times.
Note that the image ends up being constructed in software. It is not just a matter of overlaying the small images in a certain way.
There is a good reason NVIDIA showed this at its conference. Suddenly we all want little cameras with GPUs powerful enough to do this on the fly.
Exhibiting here at the NVIDIA GPU Technology Conference is a Cambridge-based company called tidepowerd, whose product GPU.NET brings GPU programming to .NET developers. The product includes both a compiler and a runtime engine, and one of the advantages of this hybrid approach is that your code will run anywhere. It will use both NVIDIA and AMD GPUs, if they support GPU programming, or fall back to the CPU if neither is available. From the samples I saw, it also looks easier than coding CUDA C or OpenCL; just add an attribute to have a function run on the GPU rather than the CPU. Of course underlying issues remain – the GPU is still isolated and cannot access global variables available to the rest of your application – but the Visual Studio tools try to warn of any such issues at compile time.
GPU.NET is in development and will go into public beta “by October 31st 2010” – head to the web site for more details.
I am not sure what sort of performance or other compromises GPU.NET introduces, but it strikes me that this kind of tool is needed to make GPU programming accessible to a wider range of developers.
NVIDIA CEO Jen-Hsung Huang spoke to the press at the GPU Technology Conference and I took the opportunity to ask some questions.
I asked for his views on the cloud as a supercomputer and whether that would impact the need for local supercomputers of the kind GPU computing enables.
Although we expect more and more to happen in the cloud, in the meantime we’re going to keep buying devices with more and more solid state memory. The way to think about it is, storage is simply a surrogate for bandwidth. If we had infinite bandwidth none of us would need storage. As bandwidth improves the requirement for storage should reduce. But there’s another trend which is that the amount of data we collect is growing incredibly fast … It’s going to be quite a long time before our need for storage will reduce.
But what about local computing power, Gigaflops as opposed to storage?
Wherever there is storage, there’s GigaFlops. Local storage, local computing.
Next, I brought up a subject which has been puzzling me here at GTC. You can do GPU programming with NVIDIA’s CUDA C, which only works on NVIDIA GPUs, or with OpenCL which works with other vendor’s GPUs as well. Why is there more focus here on CUDA, when on the face of it developers would be better off with the cross-GPU approach? (Of course I know part of the answer, that NVIDIA does not mind locking developers to its own products).
The reason we focus all our evangelism and energy on CUDA is because CUDA requires us to, OpenCL does not. OpenCL has the benefit of IBM, AMD, Intel, and ourselves. Now CUDA is a little difference in that its programming approach is different. Instead of an API it’s a language extension. You program in C, it’s a different model.
The reason why CUDA is more adopted than OpenCL is because it is simply more advanced. We’ve invested in CUDA much longer. The quality of the compiler is much better. The robustness of the programming environment is better. The tools around it are better, and there are more people programming it. The ecosystem is richer.
People ask me how do we feel about the fact that it is proprietary. There’s two ways to think about it. There’s CUDA and there’s Tesla. Tesla’s not proprietary at all, Tesla supports OpenCL and CUDA. If you bought a server with Tesla in it, you’re not getting anything less, you’re getting CUDA more. That’s the reason Tesla has been adopted by all the OEMs. If you want a GPU cluster, would you want one that only does OpenCL? Or does OpenCL and CUDA? 80% of GPU computing today is CUDA, 20% is OpenCL. If you want to reach 100% of it, you’re better off using Tesla. Over time, if more people use OpenCL that’s fine with us. The most important thing is GPU computing, the next most important thing to us is NVIDIA’s GPUs, and the next is CUDA. It’s way down the list.
Next, a hot topic. Jen-Hsun Huang explained why he announced a roadmap for future graphics chip architectures – Kepler in 2011, Maxwell in 2013 – so that software developers engaged in GPU programming can plan their projects. I asked him why Fermi, the current chip architecture, had been so delayed, and whether there was good reason to have confidence in the newly announced dates.
He answered by explaining the Fermi delay in both technical and management terms.
The technical answer is that there’s a piece of functionality that is between the shared symmetric multiprocessors (SMs), 236 processors, that need to communicate with each other, and with memory of all different types. So there’s SMs up here, and underneath the memories. In between there is a very complicated inter-connecting system that is very fast. It’s nearly all wires, dense metal with very little logic … we call that the fabric.
When you have wires that are next to each other that closely they couple, they interfere … it’s a solid mesh of metal. We found a major breakdown between the models, the tools, and reality. We got the first Fermi back. That piece of fabric – imagine we are all processors. All of us seem to be working. But we can’t talk to each other. We found out it’s because the connection between us is completely broken. We re-engineered the whole thing and made it work.
Your question was deeper than that. Your question wasn’t just what broke with Fermi – it was the fabric – but the question is how would you not let it happen again? It won’t be fabric next time, it will be something else.
The reason why the fabric failed isn’t because it was hard, but because it sat between the responsibility of two groups. The fabric is complicated because there’s an architectural component, a logic design component, and there’s a physics component. My engineers who know physics and my engineers who know architecture are in two different organisations. We let it sit right in the middle. So the management lesson learned – there should always be a pilot in charge.
Huang spent some time discussing changes in the industry. He identifies mobile computing “superphones” and tablets as the focus of a major shift happening now. Someone asked “What does that mean for your Geforce business?”
I don’t think like that. The way I think is, “what is my personal computer business”. The personal computer business is Geforce plus Tegra. If you start a business, don’t think about the product you make. Think about the customer you’re making it for. I want to give them the best possible personal computing experience.
Tegra is NVIDIA’s complete system on a chip, including ARM processor and of course NVIDIA graphics, aimed at mobile devices. NVIDIA’s challenge is that its success with Geforce does not guarantee success with Tegra, for which it is early days.
The further implication is that the immediate future may not be easy, as traditional PC and laptop sales decline.
The mainstream business for the personal computer industry will be rocky for some time. The reason is not because of the economy but because of mobile computing. The PC … will be under disruption from tablets. The difference between a tablet and a PC is going to become very small. Over the next few years we’re going to see that more and more people use their mobile device as their primary computer.
[Holds up Blackberry] There’s no question right now that this is my primary computer.
The rise of mobile devices is a topic Huang has returned to on several occasions here. “ARM is the most important CPU architecture, instruction set architecture, of the future” he told the keynote audience.
Clearly NVIDIA’s business plans are not without risk; but you cannot fault Huang for enthusiasm or awareness of coming changes. It is clear to me that NVIDIA has the attention of the scientific and academic community for GPU computing, and workstation OEMs are scrambling to built Tesla GPU computing cards into their systems, but transitions in the market for its mass-market graphics cards will be tricky for the company.
Update: Huang’s comments about the reasons for Fermi’s delay raised considerable interest as apparently he had not spoken about this on record before. Journalist Nico Ernst captured the moment on video:
I’m at NVIDIA’s GPU tech conference in San Jose. The central theme of the conference is that the capabilities of modern GPUs enable substantial performance gains for general computing, not just for graphics, though most of the examples we have seen involve some element of graphical processing. The reason you should care about this is that the gains are huge.
Take Matlab for example, a popular language and IDE for algorithm development, data analysis and mathematical computation. We were told in the keynote here yesterday that Matlab is offering a parallel computing toolkit based on NVIDIA’s CUDA, with speed-ups from 10 to 40 times. Dramatic performance improvements opens up new possibilities in computing.
Why has GPU performance advanced so rapidly, whereas CPU performance has levelled off? The reason is that they use different computing models. CPUs are general-purpose. The focus is on fast serial computation, executing a single thread as rapidly as possible. Since many applications are largely single-thread, this is what we need, but there are technical barriers to increasing clock speed. Of course multi-core and multi-processor systems are now standard, so we have dual-core or quad-core machines, with big performance gains for multi-threaded applications.
By contrast, GPUs are designed to be massively parallel. A Tesla C1060 has not 2 or 4 or 8 cores, but 240; the C2050 has 448. These are not the same as CPU cores, but nevertheless do execute in parallel. The clock speed is only 1.3Ghz, whereas an Intel Core i7 Extreme is 3.3Ghz, but the Intel CPU has a mere 6 cores. An Intel Xeon 7560 runs at 2.266 Ghz and has 8 cores.The lower clock speed in the GPU is one reason it is more power-efficient.
NVIDIA’s CUDA initiative is about making this capability available to any application. NVIDIA made changes to its hardware to make it more amenable to standard C code, and delivered CUDA C with extensions to support it. In essence it is pretty simple. The extensions let you specify functions to execute on the GPU, allocate memory for pointers on the GPU, and copy memory between the GPU (called the device) and the main memory on the PC (called the host). You can also synchronize threads and use shared memory between threads.
The reward is great performance, but there are several disadvantages. One is the challenge of concurrent programming and the subtle bugs it can introduce.
Another is the hassle of copying memory between host and device. The device is in effect a computer within a computer. Shifting data between the two is relatively show.
A third is that CUDA is proprietary to NVIDIA. If you want your code to work with ATI’s equivalent, called Streams, then you should use the OpenCL library, though I’ve noticed that most people here seem to use CUDA; I presume they are able to specify the hardware and would rather avoid the compromises of a cross-GPU library. In the worst case, if you need to support both CUDA and non-CUDA systems, you might need to support different code paths depending on what is detected at runtime.
It is all a bit messy, though there are tools and libraries to simplify the task. For example, this morning we heard about GMAC, which makes host and device appear to use a single address space, though I imagine there are performance implications.
NVIDIA says it is democratizing supercomputing, bringing high performance computing within reach for almost anyone. There is something in that; but at the same time as a developer I would rather not think about whether my code will execute on the CPU or the GPU. Viewed at the highest level, I find it disappointing that to get great performance I need to bolster the capabilities of the CPU with a specialist add-on. The triumph of the GPU is in a sense the failure of the CPU. Convergence in some form or other strikes me as inevitable.
At the NVIDIA GPU Technology Conference in San Jose CEO Jen-Hsun Huang talked up the company’s progress in GPU computing, showed some example applications, and announced a high-level roadmap for future graphics chip architectures. NVIDIA has three areas of focus, he said: the Quadro line for visualisation, Tesla for parallel computing, and GeForce/Tegra for personal computing. Tegra is a system on a chip aimed at mobile devices. Mobile, says Huang, is “a completely disruptive force to all of computing.”
NVIDIA’s current chip architecture is called Fermi. The company is settling on a two-year product cycle and will deliver Kepler in 2011 with 3 to 4 times the performance (expressed as Gigaflops per watt) of Fermi. Maxwell in 2013 will have around 12 times the performance of Fermi. In between these architecture changes, NVIDIA will do “kicker” updates to refresh its products, with one for Fermi due soon.
The focus of the conference though is not on super-fast graphics cards in themselves, but rather on using the GPU for general purpose computing. GPUs are very, very good at doing mathematics fast and in parallel. If you have an application that does intensive calculations, then executing that part of the code on the GPU can offer impressive performance increases. NVIDIA’s CUDA library for C lets you do exactly that. Another option is OpenCL, a standard that works across GPUs from multiple vendors.
Adobe uses CUDA for the Mercury Playback engine in Creative Suite 5, greatly improving performance in After Effects, Premiere Pro and Photoshop, but with the annoyance that you have to use a compatible NVIDIA graphics card.
The performance gain from GPU programming is so great that it is unavoidable for applications in relevant areas, such as simulation or statistical analysis. Huang gave a compelling example during the keynote, bringing heart surgeon Dr Michael Black on stage to talk about his work. Operating on a beating heart is difficult because it presents a moving target. By combining robotic surgery with software that is able to predict the heart’s movement through simulation, he is researching how to operate on a heart almost as if it were stopped and with just a small incision.
Programming the GPU is compelling, but difficult. NVIDIA is keen to see it become part of mainstream programming, for obvious reasons, and there are new libraries and tools which help with this, like Parallel Nsight for Visual Studio 2010. Another interesting development, announced today, is CUDA for x86, being developed by PGI, which will let your CUDA code run even when an NVIDIA GPU is not present. Even if the performance gains are limited, it will mean developers who need to support diverse systems can run the same code, rather than having a different code path when no CUDA GPU is detected.
That said, GPU programming still has all the challenges of concurrent development, prone to race conditions and synchronization problems.
Stuffing a server full of GPUs is a cost-effective route to super-computing. I took a brief look at the exhibition, which includes this Colfax CXT8000 with 8 Tesla GPUs; it also has three 1200W power supplies. It may cost $25,000 but if you look at the performance you are getting for the price, machines like this are great value.