[LINK] itN: 'Salesforce users in 16 hour outage outrage'

Martin Barry marty at supine.com
Sat May 14 05:28:42 AEST 2016


$quoted_author = "Roger Clarke" ;
> 
> Salesforce chief Marc Benioff has been forced to apologise to the
> company's customers after a 16 hour outage that is still ongoing [due?] to
> a North American instance [which?] downed operations around the country.
> 
> [I'm having trouble parsing that sentence.]

Salesforce shard their customer base across a number of clusters which they
refer to as instances. You can see them listed here 
https://trust.salesforce.com/trust/instances

Click on "history" in the top right and you can see that there was a short
outage early on May 10 (in UTC time) which appears to have affected 3
production instances (NA11, NA12 and NA14) and 3 test instances (CS9, CS10
and CS11).

NA14 then experienced about 20 hours of downtime and "degradation" since.


> Salesforce had moved the NA14 instance to a new site in Washington DC
> around eight hours before the outage, after a circuit breaker failure
> caused two hours of downtime at its former primary data centre in Herndon,
> Virginia.
> 
> [Maybe I'm naive, but I thought instances were managed by software rather
> than by people, and that failure of instances was normal, and that
> recovery from failed instances was too.  Whatever happened to rollback and
> recovery techniques?]

It might just be the language here. Instance == cluster. So while there
would be internal redundancy and automation, cutting the entire instance
over to a DR site would be something that would be triggered by a person
after a proper triage of the primary site problems.


> [So they *do* maintain occasional frozen mirrors and actionable logs -
> which would enable restart from a prior state and re-run forwards.  But
> they *don't* exercise the recovery routines, and hence are making it up as
> they go along?!] 

It sounds like they have some kind of replication to the DR site as well as
backups. The initial recovery path was a switch to the DR site which appears
to have been scuppered by some kind of catastrophic failure. They've now
switched to recovering from backups.

cheers
Marty



More information about the Link mailing list