[LINK] Web Central outage - was The Sky didn't fall. Yet.
linda rouse
linda at databasics.com.au
Wed May 6 11:26:23 AEST 2009
May 2009, Tom Koltai wrote:
>
>> Well Linkers - it appears that I shouldn't give up my day job.
>> The market appears stable.
>> OK - resolutions - no more stock tips.
>> Upgrade PC so the 287 maths chip doesn't keep giving floating point
>> errors ;-)
>> Coz I put in the corrections and yesterdays retail numbers and it still
>> says the 5th.
>>
>> Stupid PC.
>
> Time to get a real computer! ;)
>
> rachel
Well said Rachel! :-)
Mind you, I have been far more concerned about the Web Central mail
outage last week that had our office without email for 48+ hours along
with 200,000 other users! Do we churn or not - THAT is the question :-)
I'm surprised no one else reported it on link - there was some coverage
in the IT press:
http://www.smartcompany.com.au/index.php?option=com_content&task=view&id=32514&Itemid=290
http://mobile.itnews.com.au/Article.aspx?CIID=143657&type=News
http://mobile.itnews.com.au/Article.aspx?CIID=144015&type=News
Received a very apologetic email from Web Central which says it all...
*Mail Incident April 2009*
Dear Customer,
As you know, we experienced a major outage to our email services this
week, which has resulted in many of our customers having been
inconvenienced.
We do appreciate that email is critical to our customers businesses and
sincerely apologise for any inconvenience you have suffered and for the
impact these issues have had on you and your business.
High Level Timeline
Tuesday, 28 April 2009
Our technical teams worked around the clock since the initial issue with
our storage system was discovered at 1:00am on Tuesday morning, 28
April. A Critical Incident Manager was appointed and an incident team
assembled directly after the mail system failed at 9:30am. The
underlying issue with the storage system was resolved on Tuesday evening.
Wednesday, 29 April 2009
The email system was brought back into service on Wednesday morning. A
new issue was detected shortly after that which indicated data
corruption was now present in the mail system itself. We had to take
this system offline to ensure that the integrity of our customers email
data was protected and to avoid any email data loss. Full data
integrity checks were performed on all 10 message stores throughout the
day and night on Wednesday and all data corruption was fixed.
If data corruption had not been able to be resolved in this way,
customer data could still have been restored from the backup copies of
our mail system although the volume of data involved is so large that it
would have taken significantly longer to restore and our priority was to
avoid making a bad situation worse.
Thursday, 30 April 2009
The system was returned to service at about 12:40am on Thursday.
Customers were able to access the mail platform on Thursday morning and
the backlog of email that had been queued for delivery was being
successfully delivered to customers mailboxes, with about 90% of the
backlog delivered by 9am. About 9:30am we became aware that some
customers were having connection issues again and further
investigations were undertaken. These connection issues were caused by
the load on the system from our customers reconnecting and downloading
large volumes of mail that had queued over the previous two days. We
made some further changes to our network which helped to alleviate these
issues for most customers and after lunch time on Thursday we saw the
system performance improve for most customers. We continued to monitor
and tweak the system throughout Thursday night.
Friday, 1 May 2009
From Friday morning, the mail system was fully operational and
customers were again able to access their email. We continued to
monitor performance throughout the day.
Next Steps & Lessons
Over the course of this outage we have received a lot of feedback from
our customers. We would like to assure you that we plan to make some
significant changes going forward, in particular to our communication
with our customers during major incidents and more especially, if email
is impacted.
Here are some of the things that we plan to implement to help to address
the concerns that our customers have raised this week:
* Establish a service status page on the www.webcentral.com.au web
site where information will be posted on the health of all of the
services that we provide each day. Customer updates during incidents
will also be posted here as well as in Mission Control System News
* Implement changes to our phone system to enable us to better
handleexceptionally high call volumes
* Investigate additional ways to deliver a recorded message to
customers who call for information in times of crisis as our current
system was so overwhelmed by the volume of received calls, that this
feature stopped working
In the past 12 months, prior to this major incident, our POP email
platform experienced uptime in excess of 99.99% following a range of
investments to improve the redundancy and performance of this system.
We are confident that we can return the stability of this system to the
same levels going forward, and are currently reviewing the steps we plan
to take on the operational side to help to ensure a service failure of
this nature does not recur. We will provide an update to you on these
steps next week.
Thank you for your patience and forbearance this week.
Kind Regards
The WebCentral Team
regards
Linda
--
Linda Rouse, Information Manager
DataBasics Pty Limited
1300 886 238 (bus hrs) linda at databasics.com.au
http://www.databasics.com.au
More information about the Link
mailing list