[LINK] Web Central outage - was The Sky didn't fall. Yet.

linda rouse linda at databasics.com.au
Wed May 6 11:26:23 EST 2009


May 2009, Tom Koltai wrote:
> 
>> Well Linkers - it appears that I shouldn't give up my day job.
>> The market appears stable.
>> OK - resolutions - no more stock tips.
>> Upgrade PC so the 287 maths chip doesn't keep giving floating point
>> errors ;-)
>> Coz I put in the corrections and yesterdays retail numbers and it still
>> says the 5th.
>>
>> Stupid PC.
> 
> Time to get a real computer!   ;)
> 
> rachel

Well said Rachel! :-)

Mind you, I have been far more concerned about the Web Central mail 
outage last week that had our office without email for 48+ hours along 
with 200,000 other users! Do we churn or not - THAT is the question :-)
I'm surprised no one else reported it on link - there was some coverage 
in the IT press:

http://www.smartcompany.com.au/index.php?option=com_content&task=view&id=32514&Itemid=290
http://mobile.itnews.com.au/Article.aspx?CIID=143657&type=News
http://mobile.itnews.com.au/Article.aspx?CIID=144015&type=News


Received a very apologetic email from Web Central which says it all...

*Mail Incident April 2009*

Dear Customer,
As you know, we experienced a major outage to our email services this 
week, which has resulted in many of our customers having been 
inconvenienced.

We do appreciate that email is critical to our customer’s businesses and
sincerely apologise for any inconvenience you have suffered and for the 
impact these issues have had on you and your business.

High Level Timeline

Tuesday, 28 April 2009

Our technical teams worked around the clock since the initial issue with 
our storage system was discovered at 1:00am on Tuesday morning, 28 
April.  A Critical Incident Manager was appointed and an incident team 
assembled directly after the mail system failed at 9:30am. The 
underlying issue with the storage system was resolved on Tuesday evening.

Wednesday, 29 April 2009

The email system was brought back into service on Wednesday morning.  A 
new issue was detected shortly after that which indicated data 
corruption was now present in the mail system itself.  We had to take 
this system offline to ensure that the integrity of our customers’ email 
data was protected and to avoid any email data loss.  Full data 
integrity checks were performed on all 10 message stores throughout the 
day and night on Wednesday and all data corruption was fixed.

If data corruption had not been able to be resolved in this way, 
customer data could still have been restored from the backup copies of 
our mail system although the volume of data involved is so large that it 
would have taken significantly longer to restore and our priority was to 
avoid making a bad situation worse.

Thursday, 30 April 2009

The system was returned to service at about 12:40am on Thursday. 
Customers were able to access the mail platform on Thursday morning and 
the backlog of email that had been queued for delivery was being 
successfully delivered to customers’ mailboxes, with about 90% of the 
backlog delivered by 9am.  About 9:30am we became aware that some 
customers were having connection issues again and further
investigations were undertaken.  These connection issues were caused by 
the load on the system from our customers reconnecting and downloading 
large volumes of mail that had queued over the previous two days.  We 
made some further changes to our network which helped to alleviate these 
issues for most customers and after lunch time on Thursday we saw the 
system performance improve for most customers.  We continued to monitor 
and tweak the system throughout Thursday night.

Friday, 1 May 2009

 From Friday morning, the mail system was fully operational and 
customers were again able to access their email.  We continued to 
monitor performance throughout the day.

Next Steps & Lessons

Over the course of this outage we have received a lot of feedback from 
our customers.  We would like to assure you that we plan to make some 
significant changes going forward, in particular to our communication 
with our customers during major incidents and more especially, if email 
is impacted.

Here are some of the things that we plan to implement to help to address 
the concerns that our customers have raised this week:

     * Establish a service status page on the www.webcentral.com.au web 
site where information will be posted on the health of all of the 
services that we provide each day. Customer updates during incidents 
will also be posted here as well as in Mission Control ‘System News’

     * Implement changes to our phone system to enable us to better 
handleexceptionally high call volumes

     * Investigate additional ways to deliver a recorded message to 
customers who call for information in times of crisis as our current 
system was so overwhelmed by the volume of received calls, that this 
feature stopped working

In the past 12 months, prior to this major incident, our POP email 
platform experienced uptime in excess of 99.99% following a range of 
investments to improve the redundancy and performance of this system.

We are confident that we can return the stability of this system to the 
same levels going forward, and are currently reviewing the steps we plan 
to take on the operational side to help to ensure a service failure of 
this nature does not recur.  We will provide an update to you on these 
steps next week.

Thank you for your patience and forbearance this week.
Kind Regards
The WebCentral Team

regards
Linda

-- 
Linda Rouse, Information Manager
DataBasics Pty Limited
1300 886 238 (bus hrs) linda at databasics.com.au
http://www.databasics.com.au


More information about the Link mailing list