Amazon is into day three of a major failure of its Elastic Compute Cloud at its North Virginia datacenter, and at the time of writing it is still not fully recovered.
I am reminded of a prescient remark by Tony Lucas at Flexiant, a UK cloud provider, who told me a couple of years ago (with commendable honesty) that cloud failures will be rare, but when they occur will be on a grand scale.
It seems that it is hard to engineer around the possibility of cascading failure. I am not sure what happened in North Virginia, but Amazon says on its status page that:
A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances.
It sounds like an automated recovery system built into the compute cloud actually became the problem, as a large number of volumes tried to fix themselves at the same time.
This is not the first Amazon outage, but I believe it is the most severe; though it could have been worse and I have not heard that any data was lost. What are the implications?
Any computer system can fail. There will be a lot of companies reflecting on this though, both those directly affected and others, and realising that the cloud can be a single point of failure, despite the scale and expertise which a company like Amazon invests in high availability.
Is Amazon EC2 more or less likely to fail for an extended period than Salesforce.com? Or Microsoft Azure? Or Google App Engine, or Gmail, or IBM’s evolving SmartCloud? Clearly an excellent question; but I am not sure how we go about answering it other than by reviewing historical performance. I do not expect any of these companies to take advantage of Amazon’s problems to proclaim their own superior resiliency; they will all be worrying too much about the same thing happening on their platforms.
My guess is that the industry will get better at this, and that at some unspecified future moment the chance of one of these cloud platforms failing for three days will become exceedingly small – of course risk can never be eliminated, only reduced.
It seems that the risk is not exceedingly small on Amazon’s cloud today; and we should probably assume that the same applies to other providers.
That is something we have always known, so in one sense nothing has changed. This outage is a sharp reminder though; and planning for failure is a hidden cost of cloud computing that has now been brought into the light.