[LINK] virgin blue outage II

Tue Feb 22 12:35:39 AEDT 2011

On 21/02/11 14:36, Philip Argy wrote:
> Could the use of solid state disk arrays make them abnormally vulnerable to
> a power outage?  It seems pretty amazing to me that something as simple as a
> power outage would be an issue in this day and age!
we have moved on a little from the 'olden days' it seems. and yet, not
really.

to my mind, not having a substantial ups capability built into data
centre infrastructure is just as bad as having no back up of your data,
or no alternative hardware available - seems the event of unexpected
events is so unexpected that expecting any expectation of them is not to
be expected .... *headdesk*

[last i heard, virgin blue were still using the same vendor/provider as
a couple of other local point-to-point airlines. and yet, it seems to be
vb that is getting all the 'bad luck' ...]

> -----Original Message-----
> From: link-bounces at mailman.anu.edu.au
> [mailto:link-bounces at mailman.anu.edu.au] On Behalf Of Rachel Polanskis
> Sent: Friday, October 08, 2010 5:29 PM
> To: Link list
> Subject: [LINK] virgin blue outage
>
> Hi,
> I was in a meeting today, with some product vendors whose name starts with
> the 15th letter of the alphabet.   We briefly discussed the virgin blue
> airline checkout crash.   Apparently, 
> those in the know told us that the problem was caused by a netapp data
> server that uses 
> solid state (ssd) disk drives in the array.  According the the guy that I
> spoke to, this was 
> a new system that is arguably using bleeding edge hardware and the issue was
> caused by
> firmware mismatches on the drives themselves, vs the netapp RAID layer.  How
> true this 
> is I do not know, but the people concerned did seem to have some knowledge
> of the event...
with the outsource vendor causing a much extended downtime in that (now
'first' system downtime) by trying to 'fix' the problem, instead of
bringing a 'backup' system online.

seems that learning from the past is just as hard for some in ict as it
is for some in much older industries.

only, in ict, all the 'big' events have occurred within a single human
lifetime ... and we're talking about systems specifically designed to
accumulate and process data and information, and we *still* manage to
fail to apply that to itself. o.O

[[i originally considered examining certain kinds of systemic failures
in large ict systems for my phd research. but i/we couldn't get anyone
to 'buy in' - no one wanted their dirty laundry aired (even though it is
quite possible to make parts/whole theses confidential ...) unless and
until we have an active, collective memory regarding the causes,
consequences, and resolutions of systems failures, we *will* continue to
see them occur again and again etc.]]

-- 
Steven