[LINK] ZDnet: Lightning Strikes Twice in Amazon Cloud
Roger Clarke
Roger.Clarke at xamax.com.au
Fri Aug 12 07:08:03 AEST 2011
Amazon deletes customer data
Suzanne Tindal
ZDNet.com.au
August 11th, 2011 (21 hours ago)
http://www.zdnet.com.au/amazon-deletes-customer-data-339319187.htm
As if having lightning strike its EU datacentre wasn't enough, Amazon
is now struggling with a software problem that saw some customer data
deleted. It's working on restoring the data, but it may not be
successful in all cases, and some customers will be out for much
longer than they wanted. It looks bad for Amazon, especially after
the company's April US outage.
AWS cloud accidentally deletes customer data
Jack Clark
ZDNet UK
10 August, 2011 15:44
http://www.zdnet.co.uk/news/cloud/2011/08/10/aws-cloud-accidentally-deletes-customer-data-40093665/
NEWS
After lightning downed parts of Amazon's European cloud over the
weekend, a fault appeared in the company's storage software that
caused the system to accidentally delete customer data.
Read this
AWS disrupted by US east coast failure
Amazon Web Services' cloud has taken a hit from problems in its US
East Coast region, downing multiple sites that depend upon the
service.
The software bug began deleting customer data after the outage on
Sunday, according to Amazon Web Services (AWS). The cloud services
provider was still attempting to recover customer data held in its
Elastic Block Storage (EBS) on Wednesday, meaning some customers are
still having downtime three days after the initial problem.
AWS's rentable computers - known as 'instances' - typically use EBS
to store data. The data is placed on hardware separate from that
running the instance, and the data is served to the instance via a
network connection. The bug lies in the part of EBS that manages
stored images of EBS data pools, known as 'snapshots'.
"Independent from the power issue in the affected availability zone,
we've discovered an error in the EBS software that cleans up unused
[EBS] snapshots," AWS wrote on its status page on Monday. "During a
recent run of this EBS software in the EU-West Region, one or more
blocks in a number of EBS snapshots were incorrectly deleted.
"The root cause was a software error that caused the snapshot
references to a subset of blocks to be missed during the
reference-counting process. As a result of the software error, the
EBS snapshot management system in the EU-West Region incorrectly
thought some of the blocks were no longer being used and deleted
them," it added.
Recovery snapshots
Since then, AWS has been working to create recovery snapshots for
customers to help them resurrect the data volumes. This may not be a
foolproof solution, as some of the data in the restored pools of
data, or 'volumes', could be inconsistent, the company said. This
could cause trouble for applications reliant on the data, it added.
Either way, it will take time for all the affected customers to
receive their recovery snapshots, because creating them "requires
[AWS] to move and process large amounts of data", Amazon said. This
is "why it is taking a long time to complete, particularly for some
of the larger volumes. As recovery snapshots become available,
customers will see them appear in their accounts", it added.
It has been quite a long outage, I wouldn't expect that level of
outage on any of our other systems.
- Paul Armstrong, AWS customer
Within Amazon's European region - EU-West - there are three
availability zones. Each EBS volume is tied to a specific
availability zone and is backed up to several storage devices. While
there is redundancy within the zone, if the whole zone goes down, it
can take EBS with it.
"I have been concerned [by the EBS problems]," Paul Armstrong, the
business systems manager of AWS customer Haven Power, told ZDNet UK.
"It has disrupted our service to some extent. It has been quite a
long outage, I wouldn't expect that level of outage on any of our
other systems."
AWS customers can also store their data in the company's Scalable
Storage Cloud (S3). This acts like a tape backup service in that it
is good for storing large quantities of information, but is slower to
deliver it when needed.
However, S3 cannot directly connect to instances, and EBS is
typically used as a mediator between the two.
In addition, customers can use 'ephemeral' storage, which is directly
attached to the individual instance. Ephemeral data has drawbacks,
compared with EBS, because it co-exists with the instance and will
disappear if the instance is hit by problems.
Troubled history
EBS has attracted criticism in the past from customers over the
quality of service provided, and the service saw failures in March
and April that generated sharp responses from some.
"Amazon's EBSes are a barrel of laughs in terms of performance and
reliability and are a constant (and the single largest) source of
failure across Reddit," a former Reddit programmer wrote in March,
after a cascading fail in EBS led to outages at Reddit, Quora and a
host of other sites.
Read this
Critics also argue that because EBS is a shared storage environment,
heavy use by one customer can get all the others on the same server
into trouble.
"I've heard complaints about EBS suffering from 'noisy neighbour
syndrome' here," Colin Percival, a security researcher and Amazon
cloud user, told ZDNet UK. "I don't know if this is a problem with
the underlying EBS storage or if it's just the (unavoidable) problem
of EC2 nodes hosting multiple EC2 instances, and the EC2 nodes having
limited network bandwidth."
EBS's main problem may stem from its lack of redundancy. Ewan Leith,
founder of system migration company Nutmeg Data, noted that EBS
images are locked into a single availability zone. If problems occur
in that zone, the image cannot be moved to a zone in another region.
"When a zone goes down, EBS is almost always the last to be
recovered," Leith said.
Amazon has a history of expanding its services to run on multiple
availability zones, as it did with its virtual private cloud product
on Thursday. However, it has not publicly disclosed any plans to do
allow a single EBS pool to straddle multiple availability zones.
--
Roger Clarke http://www.rogerclarke.com/
Xamax Consultancy Pty Ltd 78 Sidaway St, Chapman ACT 2611 AUSTRALIA
Tel: +61 2 6288 1472, and 6288 6916
mailto:Roger.Clarke at xamax.com.au http://www.xamax.com.au/
Visiting Professor in the Cyberspace Law & Policy Centre Uni of NSW
Visiting Professor in Computer Science Australian National University
More information about the Link
mailing list