[LINK] ZDnet: Lightning Strikes Twice in Amazon Cloud

Roger Clarke Roger.Clarke at xamax.com.au
Fri Aug 12 07:08:03 AEST 2011


Amazon deletes customer data
Suzanne Tindal
ZDNet.com.au
August 11th, 2011 (21 hours ago)
http://www.zdnet.com.au/amazon-deletes-customer-data-339319187.htm

As if having lightning strike its EU datacentre wasn't enough, Amazon 
is now struggling with a software problem that saw some customer data 
deleted. It's working on restoring the data, but it may not be 
successful in all cases, and some customers will be out for much 
longer than they wanted. It looks bad for Amazon, especially after 
the company's April US outage.


AWS cloud accidentally deletes customer data
Jack Clark
ZDNet UK
10 August, 2011 15:44
http://www.zdnet.co.uk/news/cloud/2011/08/10/aws-cloud-accidentally-deletes-customer-data-40093665/

NEWS
After lightning downed parts of Amazon's European cloud over the 
weekend, a fault appeared in the company's storage software that 
caused the system to accidentally delete customer data.
Read this
AWS disrupted by US east coast failure
Amazon Web Services' cloud has taken a hit from problems in its US 
East Coast region, downing multiple sites that depend upon the 
service.

The software bug began deleting customer data after the outage on 
Sunday, according to Amazon Web Services (AWS). The cloud services 
provider was still attempting to recover customer data held in its 
Elastic Block Storage (EBS) on Wednesday, meaning some customers are 
still having downtime three days after the initial problem.

AWS's rentable computers - known as 'instances' - typically use EBS 
to store data. The data is placed on hardware separate from that 
running the instance, and the data is served to the instance via a 
network connection. The bug lies in the part of EBS that manages 
stored images of EBS data pools, known as 'snapshots'.

"Independent from the power issue in the affected availability zone, 
we've discovered an error in the EBS software that cleans up unused 
[EBS] snapshots," AWS wrote on its status page on Monday. "During a 
recent run of this EBS software in the EU-West Region, one or more 
blocks in a number of EBS snapshots were incorrectly deleted.

"The root cause was a software error that caused the snapshot 
references to a subset of blocks to be missed during the 
reference-counting process. As a result of the software error, the 
EBS snapshot management system in the EU-West Region incorrectly 
thought some of the blocks were no longer being used and deleted 
them," it added.

Recovery snapshots

Since then, AWS has been working to create recovery snapshots for 
customers to help them resurrect the data volumes. This may not be a 
foolproof solution, as some of the data in the restored pools of 
data, or 'volumes', could be inconsistent, the company said. This 
could cause trouble for applications reliant on the data, it added.

Either way, it will take time for all the affected customers to 
receive their recovery snapshots, because creating them "requires 
[AWS] to move and process large amounts of data", Amazon said. This 
is "why it is taking a long time to complete, particularly for some 
of the larger volumes. As recovery snapshots become available, 
customers will see them appear in their accounts", it added.

  It has been quite a long outage, I wouldn't expect that level of 
outage on any of our other systems.
- Paul Armstrong, AWS customer

Within Amazon's European region - EU-West - there are three 
availability zones. Each EBS volume is tied to a specific 
availability zone and is backed up to several storage devices. While 
there is redundancy within the zone, if the whole zone goes down, it 
can take EBS with it.

"I have been concerned [by the EBS problems]," Paul Armstrong, the 
business systems manager of AWS customer Haven Power, told ZDNet UK. 
"It has disrupted our service to some extent. It has been quite a 
long outage, I wouldn't expect that level of outage on any of our 
other systems."

AWS customers can also store their data in the company's Scalable 
Storage Cloud (S3). This acts like a tape backup service in that it 
is good for storing large quantities of information, but is slower to 
deliver it when needed.

However, S3 cannot directly connect to instances, and EBS is 
typically used as a mediator between the two.

In addition, customers can use 'ephemeral' storage, which is directly 
attached to the individual instance. Ephemeral data has drawbacks, 
compared with EBS, because it co-exists with the instance and will 
disappear if the instance is hit by problems.  

Troubled history

EBS has attracted criticism in the past from customers over the 
quality of service provided, and the service saw failures in March 
and April that generated sharp responses from some.

"Amazon's EBSes are a barrel of laughs in terms of performance and 
reliability and are a constant (and the single largest) source of 
failure across Reddit," a former Reddit programmer wrote in March, 
after a cascading fail in EBS led to outages at Reddit, Quora and a 
host of other sites.
Read this
Critics also argue that because EBS is a shared storage environment, 
heavy use by one customer can get all the others on the same server 
into trouble.

"I've heard complaints about EBS suffering from 'noisy neighbour 
syndrome' here," Colin Percival, a security researcher and Amazon 
cloud user, told ZDNet UK. "I don't know if this is a problem with 
the underlying EBS storage or if it's just the (unavoidable) problem 
of EC2 nodes hosting multiple EC2 instances, and the EC2 nodes having 
limited network bandwidth."

EBS's main problem may stem from its lack of redundancy. Ewan Leith, 
founder of system migration company Nutmeg Data, noted that EBS 
images are locked into a single availability zone. If problems occur 
in that zone, the image cannot be moved to a zone in another region.

"When a zone goes down, EBS is almost always the last to be 
recovered," Leith said.

Amazon has a history of expanding its services to run on multiple 
availability zones, as it did with its virtual private cloud product 
on Thursday. However, it has not publicly disclosed any plans to do 
allow a single EBS pool to straddle multiple availability zones.


-- 
Roger Clarke                                 http://www.rogerclarke.com/
			            
Xamax Consultancy Pty Ltd      78 Sidaway St, Chapman ACT 2611 AUSTRALIA
                    Tel: +61 2 6288 1472, and 6288 6916
mailto:Roger.Clarke at xamax.com.au                http://www.xamax.com.au/

Visiting Professor in the Cyberspace Law & Policy Centre      Uni of NSW
Visiting Professor in Computer Science    Australian National University



More information about the Link mailing list