[LINK] itN: 'Salesforce users in 16 hour outage outrage'

Sat May 14 05:28:42 AEST 2016

$quoted_author = "Roger Clarke" ;
> 
> Salesforce chief Marc Benioff has been forced to apologise to the
> company's customers after a 16 hour outage that is still ongoing [due?] to
> a North American instance [which?] downed operations around the country.
> 
> [I'm having trouble parsing that sentence.]

Salesforce shard their customer base across a number of clusters which they
refer to as instances. You can see them listed here 
https://trust.salesforce.com/trust/instances

Click on "history" in the top right and you can see that there was a short
outage early on May 10 (in UTC time) which appears to have affected 3
production instances (NA11, NA12 and NA14) and 3 test instances (CS9, CS10
and CS11).

NA14 then experienced about 20 hours of downtime and "degradation" since.

> Salesforce had moved the NA14 instance to a new site in Washington DC
> around eight hours before the outage, after a circuit breaker failure
> caused two hours of downtime at its former primary data centre in Herndon,
> Virginia.
> 
> [Maybe I'm naive, but I thought instances were managed by software rather
> than by people, and that failure of instances was normal, and that
> recovery from failed instances was too.  Whatever happened to rollback and
> recovery techniques?]

It might just be the language here. Instance == cluster. So while there
would be internal redundancy and automation, cutting the entire instance
over to a DR site would be something that would be triggered by a person
after a proper triage of the primary site problems.

> [So they *do* maintain occasional frozen mirrors and actionable logs -
> which would enable restart from a prior state and re-run forwards.  But
> they *don't* exercise the recovery routines, and hence are making it up as
> they go along?!] 

It sounds like they have some kind of replication to the DR site as well as
backups. The initial recovery path was a switch to the DR site which appears
to have been scuppered by some kind of catastrophic failure. They've now
switched to recovering from backups.

cheers
Marty