[LINK] The DDoS That Almost Broke the Internet
stephen at melbpc.org.au
stephen at melbpc.org.au
Thu Mar 28 12:27:12 AEDT 2013
The DDoS That Almost Broke the Internet
http://blog.cloudflare.com March 27th, 2013
The New York Times this morning published a story about the Spamhaus DDoS
attack and how CloudFlare helped mitigate it and keep the site online.
The Times calls the attack the largest known DDoS attack ever on the
Internet.
We wrote about the attack last week. <http://blog.cloudflare.com/the-ddos-
that-knocked-spamhaus-offline-and-ho> "The DDoS That Knocked Spamhaus
Offline (And How We Mitigated It)"
At the time, it was a large attack, sending 85Gbps of traffic. Since then,
the attack got much worse. Here are some of the technical details of what
we've seen.
Growth Spurt
On Monday, March 18, 2013 Spamhaus contacted CloudFlare regarding an attack
they were seeing against their website spamhaus.org. They signed up for
CloudFlare and we quickly mitigated the attack. The attack, initially, was
approximately 10Gbps generated largely from open DNS recursors. On March
19, the attack increased in size, peaking at approximately 90Gbps. The
attack fluctuated between 90Gbps and 30Gbps until 01:15 UTC on on March 21.
The attackers were quiet for a day. Then, on March 22 at 18:00 UTC, the
attack resumed, peaking at 120Gbps of traffic hitting our network. As we
discussed in the previous blog post, CloudFlare uses Anycast technology
which spreads the load of a distributed attack across all our data centers.
This allowed us to mitigate the attack without it affecting Spamhaus or any
of our other customers. The attackers ceased their attack against the
Spamhaus website four hours after it started.
Other than the scale, which was already among the largest DDoS attacks
we've seen, there was nothing particularly unusual about the attack to this
point. Then the attackers changed their tactics. Rather than attacking our
customers directly, they started going after the network providers
CloudFlare uses for bandwidth. More on that in a second, first a bit about
how the Internet works.
Peering on the Internet
The "inter" in Internet refers to the fact that it is a collection of
independent networks connected together. CloudFlare runs a network, Google
runs a network, and bandwidth providers like Level3, AT&T, and Cogent run
networks. These networks then interconnect through what are known as
peering relationships.
When you surf the web, your browser sends and receives packets of
information. These packets are sent from one network to another. You can
see this by running a traceroute. Here's one from Stanford University's
network to the New York Times' website (nytimes.com):
1 rtr-servcore1-serv01-webserv.slac.stanford.edu (134.79.197.130) 0.572 ms
2 rtr-core1-p2p-servcore1.slac.stanford.edu (134.79.252.166) 0.796 ms
3 rtr-border1-p2p-core1.slac.stanford.edu (134.79.252.133) 0.536 ms
4 slac-mr2-p2p-rtr-border1.slac.stanford.edu (192.68.191.245) 25.636 ms
5 sunncr5-ip-a-slacmr2.es.net (134.55.36.21) 3.306 ms
6 eqxsjrt1-te-sunncr5.es.net (134.55.38.146) 1.384 ms
7 xe-0-3-0.cr1.sjc2.us.above.net (64.125.24.1) 2.722 ms
8 xe-0-1-0.mpr1.sea1.us.above.net (64.125.31.17) 20.812 ms
9 209.249.122.125 (209.249.122.125) 21.385 ms
There are three networks in the above traceroute: stanford.edu, es.net, and
above.net. The request starts at Stanford. Between lines 4 and 5 it passes
from Stanford's network to their peer es.net. Then, between lines 6 and 7,
it passes from es.net to above.net, which appears to provide hosting for
the New York Times. This means Stanford has a peering relationship with
ES.net. ES.net has a peering relationship with Above.net. And Above.net
provides connectivity for the New York Times.
CloudFlare connects to a large number of networks. You can get a sense of
some, although not all, of the networks we peer with through a tool like
Hurricane Electric's BGP looking glass. CloudFlare connects to peers in two
ways. First, we connect directly to certain large carriers and other
networks to which we send a large amount of traffic. In this case, we
connect our router directly to the router at the border of the other
network, usually with a piece of fiber optic cable. Second, we connect to
what are known as Internet Exchanges, IXs for short, where a number of
networks meet in a central point.
Most major cities have an IX. The model for IXs are different in different
parts of the world. Europe runs some of the most robust IXs, and CloudFlare
connects to several of them including LINX (the London Internet Exchange),
AMS-IX (the Amsterdam Internet Exchange), and DE-CIX (the Frankfurt
Internet Exchange), among others. The major networks that make up the
Internet --Google, Facebook Yahoo, etc. -- connect to these same exchanges
to pass traffic between each other efficiently. When the Spamhaus attacker
realized he couldn't go after CloudFlare directly, he began targeting our
upstream peers and exchanges.
Headwaters
Once the attackers realized they couldn't knock CloudFlare itself offline
even with more than 100Gbps of DDoS traffic, they went after our direct
peers. In this case, they attacked the providers from whom CloudFlare buys
bandwidth. We, primarily, contract with what are known as Tier 2 providers
for CloudFlare's paid bandwidth. These companies peer with other providers
and also buy bandwidth from so-called Tier 1 providers.
There are approximately a dozen Tier 1 providers on the Internet. The
nature of these providers is that they don't buy bandwidth from anyone.
Instead, they engage in what is known as settlement-free peering with the
other Tier 1 providers. Tier 2 providers interconnect with each other and
then buy bandwidth from the Tier 1 providers in order to ensure they can
connect to every other point on the Internet. At the core of the Internet,
if all else fails, it is these Tier 1 providers that ensure that every
network is connected to every other network. If one of them fails, it's a
big deal.
Anycast means that if the attacker attacked the last step in the traceroute
then their attack would be spread across CloudFlare's worldwide network, so
instead they attacked the second to last step which concentrated the attack
on one single point. This wouldn't cause a network-wide outage, but it
could potentially cause regional problems.
We carefully select our bandwidth providers to ensure they have the ability
to deal with attacks like this. Our direct peers quickly filtered attack
traffic at their edge. This pushed the attack upstream to their direct
peers, largely Tier 1 networks. Tier 1 networks don't buy bandwidth from
anyone, so the majority of the weight of the attack ended up being carried
by them. While we don't have direct visibility into the traffic loads they
saw, we have been told by one major Tier 1 provider that they saw more than
300Gbps of attack traffic related to this attack. That would make this
attack one of the largest ever reported.
The challenge with attacks at this scale is they risk overwhelming the
systems that link together the Internet itself. The largest routers that
you can buy have, at most, 100Gbps ports. It is possible to bond more than
one of these ports together to create capacity that is greater than 100Gbps
however, at some point, there are limits to how much these routers can
handle. If that limit is exceeded then the network becomes congested and
slows down.
Over the last few days, as these attacks have increased, we've seen
congestion across several major Tier 1s, primarily in Europe where most of
the attacks were concentrated, that would have affected hundreds of
millions of people even as they surfed sites unrelated to Spamhaus or
CloudFlare. If the Internet felt a bit more sluggish for you over the last
few days in Europe, this may be part of the reason why.
Attacks on the IXs
In addition to CloudFlare's direct peers, we also connect with other
networks over the so-called Internet Exchanges (IXs). These IXs are, at
their most basic level, switches into which multiple networks connect and
can then pass bandwidth. In Europe, these IXs are run as non-profit
entities and are considered critical infrastructure. They interconnect
hundreds of the world's largest networks including CloudFlare, Google,
Facebook, and just about every other major Internet company.
Beyond attacking CloudFlare's direct peers, the attackers also attacked the
core IX infrastructure on the London Internet Exchange (LINX), the
Amsterdam Internet Exchange (AMS-IX), the Frankfurt Internet Exchange (DE-
CIX), and the Hong Kong Internet Exchange (HKIX). From our perspective, the
attacks had the largest effect on LINX which caused impact over the
exchange and LINX's systems that monitor the exchange, as visible through
the drop in traffic recorded by their monitoring systems. (Corrected: see
below for original phrasing.)
The congestion impacted many of the networks on the IXs, including
CloudFlare's. As problems were detected on the IX, we would route traffic
around them. However, several London-based CloudFlare users reported
intermittent issues over the last several days. This is the root cause of
those problems.
The attacks also exposed some vulnerabilities in the architecture of some
IXs. We, along with many other network security experts, worked with the
team at LINX to better secure themselves. In doing so, we developed a list
of best practices for any IX in order to make them less vulnerable to
attacks.
Two specific suggestions to limit attacks like this involve making it more
difficult to attack the IP addresses that members of the IX use to
interchange traffic between each other. We are working with IXs to ensure
that: 1) these IP addresses should not be announced as routable across the
public Internet; and 2) packets destined to these IP addresses should only
be permitted from other IX IP addresses. We've been very impressed with the
team at LINX and how quickly they've worked to implement these changes and
add additional security to their IX and are hopeful other IXs will quickly
follow their lead.
The Full Impact of the Open Recursor Problem
At the bottom of this attack we once again find the problem of open DNS
recursors. The attackers were able to generate more than 300Gbps of traffic
likely with a network of their own that only had access 1/100th of that
amount of traffic themselves. We've written about how these mis-configured
DNS recursors as a bomb waiting to go off that literally threatens the
stability of the Internet itself. We've now seen an attack that begins to
illustrate the full extent of the problem.
While lists of open recursors have been passed around on network security
lists for the last few years, on Monday the full extent of the problem was,
for the first time, made public. The Open Resolver Project made available
the full list of the 21.7 million open resolvers online in an effort to
shut them down.
We'd debated doing the same thing ourselves for some time but worried about
the collateral damage of what would happen if such a list fell into the
hands of the bad guys. The last five days have made clear that the bad guys
have the list of open resolvers and they are getting increasingly brazen in
the attacks they are willing to launch. We are in full support of the Open
Resolver Project and believe it is incumbent on all network providers to
work with their customers to close any open resolvers running on their
networks.
Unlike traditional botnets which could only generate limited traffic
because of the modest Internet connections and home PCs they typically run
on, these open resolvers are typically running on big servers with fat
pipes. They are like bazookas and the events of the last week have shown
the damage they can cause. What's troubling is that, compared with what is
possible, this attack may prove to be relatively modest.
As someone in charge of DDoS mitigation at one of the Internet giants
emailed me this weekend: "I've often said we don't have to prepare for the
largest-possible attack, we just have to prepare for the largest attack the
Internet can send without causing massive collateral damage to others. It
looks like you've reached that point, so... congratulations!"
At CloudFlare one of our goals is to make DDoS something you only read
about in the history books. We're proud of how our network held up under
such a massive attack and are working with our peers and partners to ensure
that the Internet overall can stand up to the threats it faces.
Correction: The original sentence about the impact on LINX was "From our
perspective, the attacks had the largest effect on LINX which for a little
over an hour on March 23 saw the infrastructure serving more than half of
the usual 1.5Tbps of peak traffic fail." That was not well phrased, and has
been edited, with notation in place.
Posted by Matthew Prince
--
Cheers,
Stephen
More information about the Link
mailing list