[LINK] Amazon's lengthy cloud outage shows the danger of complexity

Sun May 1 17:52:45 AEST 2011

http://arstechnica.com/business/news/2011/04/amazons-lengthy-cloud-outage-shows-the-danger-of-complexity.ars

> Amazon's lengthy cloud outage shows the danger of complexity

> Reddit, Foursquare, and Quora were among the many sites that went down recently due to a prolonged outage of Amazon's cloud services. On Thursday April 21, Amazon Elastic Block Store (EBS) went offline, leaving the many Web and database servers depending on that storage broken. Not until Easter Sunday (April 24) was service restored to all users. Amazon has now published a lengthy description <http://aws.amazon.com/message/65648/> describing what went wrong, and why the failure was both so catastrophic and so lengthy.


.....

> As major vendors continue to push for greater use of cloud computing, incidents such as this are sure to raise many concerns. This is not the first time Amazon has suffered a substantial outage—an uncorrected transmission error caused several hours of downtime in 2008, for example—but it was particularly severe, with prolonged unavailability and a small amout of data loss. The disruption to services that depended on the stricken Availability Zone was substantial.
> 
> With high availability one of the key selling points of cloud systems, this is a big problem. Some companies did avoid the problems; users of EBS that used multiple Availability Zones, or better yet, multiple regions, saw much less disruption to their service. But that's a move that incurs both extra costs and extra complexity, and certainly isn't something Amazon talks about when it describes its 99.95 percent availability target. With the Easter downtime, and assuming no more failures in the future, it will be more than 15 years before Amazon can boast of an average availability that high.

.....

> Such issues are the nature of the beast. Due to their scale, cloud systems must be designed to be in many ways self-monitoring and  self-repairing. During normal circumstances, this is a good thing—an EBS disk might fail, but the node will automatically ensure that it's properly replicated onto a new system so that data integrity is not jeopardized—but the behavior when things go wrong can be hard to predict, and in this case, detrimental to the overall health and stability of the platform. Testing the correct handling of failures is notoriously difficult, but as this problem shows, it's absolutely essential to the reliable running of cloud systems.
> 
> This complexity is only ever going to increase, as providers develop richer capabilities and users place greater demands on cloud systems. Managing it—and more importantly, proving that it has been managed—will be a key challenge faced by cloud providers. Until that happens, doubts about the availability and reliability of cloud services will continue to be a major influence in the thinking of IT departments and CTOs everywhere.


-- 
Kim Holburn
IT Network & Security Consultant
T: +61 2 61402408  M: +61 404072753
mailto:kim at holburn.net  aim://kimholburn
skype://kholburn - PGP Public Key on request