[LINK] Optus Outage
Kim Holburn
kim at holburn.net
Thu Nov 9 17:35:13 AEDT 2023
Great post.
>But here's the NEXT problem - all Optus networking is offshore. There's almost no-one in Australia who can physically fix it.
>So what do you do when your offshore outsourced network guys break your core network infrastructure, and you've retrenched
everyone who can fix it locally?
>You have a 7 hour outage, that's what you do.
Ouch!
We have government and other services depending on a foreign company with critical infrastructure overseas!!!!
On 2023/11/9 11:33 am, Kate Lance wrote:
> Hi Narelle,
>
> An interesting post on Mastodon from Rob Thomas, supporting the idea it was
> a route reflector overload -
> https://mastodon.au/@xrobau/111376847362633903
>
> The problem yesterday started at about 4am, when Optus told the world 'I no
> longer have any internet connectivity', and 'Do not send any internet traffic
> to me, at all'. The technical description is that they withdrew ALL of their
> routes from the #DFZ (Which is "The Internet", as seen by all the core routers
> that ACTUALLY control the internet).
>
> However, as a precursor at about 3am there was a hint that things weren't
> perfect, as there was a flurry of changes from Optus to the outside world
> saying, roughly, 'Something has changed inside my network, but you can still
> keep sending me stuff'.
>
> Now, as two final bits of possibly relevant information, the default for
> maximum-prefix on #Cisco #ASR9000 is 1048576 (this number is 'the number of
> routes that can be accepted by this router'), and MOST IMPORTANTLY the DFZ
> ("the internet") has about 980,000 routes in it at the moment. That's only 90k
> odd routes LESS than the default maximum.
>
> I'd be amazed if Optus has less than 100k internal routes that aren't visible
> to the internet, but are visible internally.
>
> So here's what I think happened. The at 3am, the first core #router was
> upgraded, and a new config was put in place. This did not join the network
> correctly, and things were half broken. What SHOULD have happened is that all
> the changes should have stopped, and either rolled back, or waited for further
> investigation (the cause being that more than 1mil routes were visible, causing
> it to shut down)
>
> However, someone decided 'Well, maybe if we upgrade the SECOND one, that'll fix
> the first one' at 4am. That broke the SECOND one, and took Optus completely off
> the internet.
>
> (Continued, see next for why this is far worse than it should have been)
> .....
>
>
> Regards, Kate
>
>
> On Wed, Nov 08, 2023 at 05:33:43PM +1100, Narelle Clark wrote:
>> Rumour has it was a BGP update from an external source that wasn't filtered
>> properly with which the BGP route reflectors then overloaded the internal
>> routers. Persistently.
>>
>> It was clearly an internal transport problem arising from an underlying IP
>> protocol. BGP fits that bill completely as it would be redistributed, and
>> clearly their management network isn't sufficiently out of band. Once a
>> network of that scale goes down like that, you can't just turn it back on
>> and expect it to all work fine - millions of devices all want to
>> re-register at once, and all those state changes across the network have to
>> converge...
>>
>> Narelle
>>
>> On Wed, 8 Nov 2023 at 10:30, Alex (Maxious) Sadleir <maxious at gmail.com>
>> wrote:
>>
>>> Around 4am, Optus networks re-announced all their BGP routes at once
>>> https://radar.cloudflare.com/routing/as7474
>>> https://radar.cloudflare.com/routing/as4804
>>> This is indicative of a change management malfunction, akin to IBM's
>>> 2016 eCensus routers restarting with no routes
>>>
>>> https://www.itnews.com.au/news/ibm-treasury-in-settlement-talks-over-census-failure-440066
>>>
>>> The VoWifi infrastructure seems to be online but unable to connect any
>>> calls
>>> https://goughlui.com/2023/11/08/breaking-optus-nationwide-outage-08-11-2023/
>>> More alarmingly 000 doesn't work on landlines
>>> https://twitter.com/lucethoughts/status/1722029287727825124 contrary
>>> to advice from emergency services
>>> https://twitter.com/nswpolice/status/1722028862161449151
>>>
>>> On Wed, Nov 8, 2023 at 10:08 AM Tom Worthington
>>> <tom.worthington at tomw.net.au> wrote:
>>>> Any more news on what caused the Optus network outage? On ABC Canberra
>>>> Radio this morning I suggested it was most likely a software upgrade
>>>> which went wrong, and would be fixed by 6pm.
>>>>
>>>> Is VoWiFi working?
>>>>
>>>> I use Telstra, but when COVID-19 struck, I purchased an Optus 4G modem,
>>>> with an Optus SIM. This was in case Telstra went down.
>>>>
>>>>
>>>> --
>>>> Tom Worthington http://www.tomw.net.au
>>>> _______________________________________________
>>>> Link mailing list
>>>> Link at anu.edu.au
>>>> https://mailman.anu.edu.au/mailman/listinfo/link
>>> _______________________________________________
>>> Link mailing list
>>> Link at anu.edu.au
>>> https://mailman.anu.edu.au/mailman/listinfo/link
>>>
>>
>> --
>>
>>
>> Narelle
>> narellec at gmail.com
>> _______________________________________________
>> Link mailing list
>> Link at anu.edu.au
>> https://mailman.anu.edu.au/mailman/listinfo/link
> _______________________________________________
> Link mailing list
> Link at anu.edu.au
> https://mailman.anu.edu.au/mailman/listinfo/link
--
Kim Holburn
IT Network & Security Consultant
+61 404072753
mailto:kim at holburn.net aim://kimholburn
skype://kholburn - PGP Public Key on request
More information about the Link
mailing list