[LINK] itNews: ' Amazon's botched backup causes cloud chaos'

Martin Barry marty at supine.com
Thu Aug 18 16:11:08 AEST 2011


$quoted_author = "Marghanita da Cruz" ;
> 
> Roger Clarke wrote:
> > You have to laugh, or you'd cry.
> > 
> > The whole point of cloud computing was supposed to be "Rapid 
> > elasticity (i.e. resources are scalable according to demand)".
> > 
> 
> Cloud Computing is more a case of economics of scale rather than
> scalability to infinity and beyond. Though ofcourse with mega-sites amazon,
> google, ebay, yahoo, alibaba etc they are facing their own challenges which
> will translate to the cloud for lesser mortals and governments.

I think there is elasticity there for the consumer (under normal conditions)
but for the provider they are discovering new failure modes for such intense
use of infrastructure. Along with the bog standard software bugs and
hardware failures.

 
> > But Amazon's explanation for their disaster was:
> > "We ran out of spare capacity before all of the volumes were able to 
> > successfully re-mirror," Amazon explained.
> > 
> > The schemozzle began when the power failed.
> > 
> > Amazon's backup generators failed to kick in.
> 
> so what's new? How many times have you been into a bank - when the teller
> explains the computers are playing up.

The only question mark I had was that a single failure took out so much of
their infrastructure. There didn't appear to be sufficient
compartmentalisation so that any single failure would have limited impact.
Their outage report¹ mentions this as well.

 
> > My interpretation of the fuzzy information provided is that foolishly 
> > designed (and obviously untested) database replication routines went 
> > mad trying to perform backups and recoveries at the same time, in all 
> > directions at once.
> 
> I think that is called panic....

Software is indeed capable of doing silly things if you don't tell it
otherwise. Seems Amazon keep discovering new forms of the "thundering herd"
problem².
 

> > That resulted in overload, and chaos.
> > 
> > [Some] customers were forced to wait up to three days for Amazon to 
> > retrieve a snapshot [i.e. complete the recovery?]
> > 
> 
> Ofcourse, like the banks, who is out of pocket due to these failures?
> I am always amused by the apologies offered by executives - it would be 
> good if we could get to a stage with consumer law, where these failures can 
> be costed and paid out - IT would get a much better reputation in the process.

They are providing fairly generous refunds³ but that just covers some of the
service paid for but not provided. I gather you can already by "My cloud
failed" insurance.

cheers
Marty


¹ http://aws.amazon.com/message/2329B7/ search for 'prevent the loss of power'

² http://en.wikipedia.org/wiki/Thundering_herd_problem

³ http://aws.amazon.com/message/2329B7/ search for 'Service Credit'



More information about the Link mailing list