[LINK] The DDoS That Almost Broke the Internet

Thu Mar 28 12:27:12 AEDT 2013

The DDoS That Almost Broke the Internet

http://blog.cloudflare.com  March 27th, 2013

The New York Times this morning published a story about the Spamhaus DDoS 
attack and how CloudFlare helped mitigate it and keep the site online. 

The Times calls the attack the largest known DDoS attack ever on the 
Internet. 

We wrote about the attack last week. <http://blog.cloudflare.com/the-ddos-
that-knocked-spamhaus-offline-and-ho> "The DDoS That Knocked Spamhaus 
Offline (And How We Mitigated It)"

At the time, it was a large attack, sending 85Gbps of traffic. Since then, 
the attack got much worse. Here are some of the technical details of what 
we've seen.

Growth Spurt

On Monday, March 18, 2013 Spamhaus contacted CloudFlare regarding an attack 
they were seeing against their website spamhaus.org. They signed up for 
CloudFlare and we quickly mitigated the attack. The attack, initially, was 
approximately 10Gbps generated largely from open DNS recursors. On March 
19, the attack increased in size, peaking at approximately 90Gbps. The 
attack fluctuated between 90Gbps and 30Gbps until 01:15 UTC on on March 21.
The attackers were quiet for a day. Then, on March 22 at 18:00 UTC, the 
attack resumed, peaking at 120Gbps of traffic hitting our network. As we 
discussed in the previous blog post, CloudFlare uses Anycast technology 
which spreads the load of a distributed attack across all our data centers. 
This allowed us to mitigate the attack without it affecting Spamhaus or any 
of our other customers. The attackers ceased their attack against the 
Spamhaus website four hours after it started.

Other than the scale, which was already among the largest DDoS attacks 
we've seen, there was nothing particularly unusual about the attack to this 
point. Then the attackers changed their tactics. Rather than attacking our 
customers directly, they started going after the network providers 
CloudFlare uses for bandwidth. More on that in a second, first a bit about 
how the Internet works.

Peering on the Internet

The "inter" in Internet refers to the fact that it is a collection of 
independent networks connected together. CloudFlare runs a network, Google 
runs a network, and bandwidth providers like Level3, AT&T, and Cogent run 
networks. These networks then interconnect through what are known as 
peering relationships.

When you surf the web, your browser sends and receives packets of 
information. These packets are sent from one network to another. You can 
see this by running a traceroute. Here's one from Stanford University's 
network to the New York Times' website (nytimes.com):

1  rtr-servcore1-serv01-webserv.slac.stanford.edu (134.79.197.130) 0.572 ms
 2  rtr-core1-p2p-servcore1.slac.stanford.edu (134.79.252.166)  0.796 ms
 3  rtr-border1-p2p-core1.slac.stanford.edu (134.79.252.133)  0.536 ms
 4  slac-mr2-p2p-rtr-border1.slac.stanford.edu (192.68.191.245)  25.636 ms
 5  sunncr5-ip-a-slacmr2.es.net (134.55.36.21)  3.306 ms
 6  eqxsjrt1-te-sunncr5.es.net (134.55.38.146)  1.384 ms
 7  xe-0-3-0.cr1.sjc2.us.above.net (64.125.24.1)  2.722 ms
 8  xe-0-1-0.mpr1.sea1.us.above.net (64.125.31.17)  20.812 ms
 9  209.249.122.125 (209.249.122.125)  21.385 ms

There are three networks in the above traceroute: stanford.edu, es.net, and 
above.net. The request starts at Stanford. Between lines 4 and 5 it passes 
from Stanford's network to their peer es.net. Then, between lines 6 and 7, 
it passes from es.net to above.net, which appears to provide hosting for 
the New York Times. This means Stanford has a peering relationship with 
ES.net. ES.net has a peering relationship with Above.net. And Above.net 
provides connectivity for the New York Times.

CloudFlare connects to a large number of networks. You can get a sense of 
some, although not all, of the networks we peer with through a tool like 
Hurricane Electric's BGP looking glass. CloudFlare connects to peers in two 
ways. First, we connect directly to certain large carriers and other 
networks to which we send a large amount of traffic. In this case, we 
connect our router directly to the router at the border of the other 
network, usually with a piece of fiber optic cable. Second, we connect to 
what are known as Internet Exchanges, IXs for short, where a number of 
networks meet in a central point.

Most major cities have an IX. The model for IXs are different in different 
parts of the world. Europe runs some of the most robust IXs, and CloudFlare 
connects to several of them including LINX (the London Internet Exchange), 
AMS-IX (the Amsterdam Internet Exchange), and DE-CIX (the Frankfurt 
Internet Exchange), among others. The major networks that make up the 
Internet --Google, Facebook Yahoo, etc. -- connect to these same exchanges 
to pass traffic between each other efficiently. When the Spamhaus attacker 
realized he couldn't go after CloudFlare directly, he began targeting our 
upstream peers and exchanges.

Headwaters

Once the attackers realized they couldn't knock CloudFlare itself offline 
even with more than 100Gbps of DDoS traffic, they went after our direct 
peers. In this case, they attacked the providers from whom CloudFlare buys 
bandwidth. We, primarily, contract with what are known as Tier 2 providers 
for CloudFlare's paid bandwidth. These companies peer with other providers 
and also buy bandwidth from so-called Tier 1 providers.

There are approximately a dozen Tier 1 providers on the Internet. The 
nature of these providers is that they don't buy bandwidth from anyone. 
Instead, they engage in what is known as settlement-free peering with the 
other Tier 1 providers. Tier 2 providers interconnect with each other and 
then buy bandwidth from the Tier 1 providers in order to ensure they can 
connect to every other point on the Internet. At the core of the Internet, 
if all else fails, it is these Tier 1 providers that ensure that every 
network is connected to every other network. If one of them fails, it's a 
big deal.
Anycast means that if the attacker attacked the last step in the traceroute 
then their attack would be spread across CloudFlare's worldwide network, so 
instead they attacked the second to last step which concentrated the attack 
on one single point. This wouldn't cause a network-wide outage, but it 
could potentially cause regional problems.

We carefully select our bandwidth providers to ensure they have the ability 
to deal with attacks like this. Our direct peers quickly filtered attack 
traffic at their edge. This pushed the attack upstream to their direct 
peers, largely Tier 1 networks. Tier 1 networks don't buy bandwidth from 
anyone, so the majority of the weight of the attack ended up being carried 
by them. While we don't have direct visibility into the traffic loads they 
saw, we have been told by one major Tier 1 provider that they saw more than 
300Gbps of attack traffic related to this attack. That would make this 
attack one of the largest ever reported.

The challenge with attacks at this scale is they risk overwhelming the 
systems that link together the Internet itself. The largest routers that 
you can buy have, at most, 100Gbps ports. It is possible to bond more than 
one of these ports together to create capacity that is greater than 100Gbps 
however, at some point, there are limits to how much these routers can 
handle. If that limit is exceeded then the network becomes congested and 
slows down.

Over the last few days, as these attacks have increased, we've seen 
congestion across several major Tier 1s, primarily in Europe where most of 
the attacks were concentrated, that would have affected hundreds of 
millions of people even as they surfed sites unrelated to Spamhaus or 
CloudFlare. If the Internet felt a bit more sluggish for you over the last 
few days in Europe, this may be part of the reason why.

Attacks on the IXs

In addition to CloudFlare's direct peers, we also connect with other 
networks over the so-called Internet Exchanges (IXs). These IXs are, at 
their most basic level, switches into which multiple networks connect and 
can then pass bandwidth. In Europe, these IXs are run as non-profit 
entities and are considered critical infrastructure. They interconnect 
hundreds of the world's largest networks including CloudFlare, Google, 
Facebook, and just about every other major Internet company.

Beyond attacking CloudFlare's direct peers, the attackers also attacked the 
core IX infrastructure on the London Internet Exchange (LINX), the 
Amsterdam Internet Exchange (AMS-IX), the Frankfurt Internet Exchange (DE-
CIX), and the Hong Kong Internet Exchange (HKIX). From our perspective, the 
attacks had the largest effect on LINX which caused impact over the 
exchange and LINX's systems that monitor the exchange, as visible through 
the drop in traffic recorded by their monitoring systems. (Corrected: see 
below for original phrasing.)

The congestion impacted many of the networks on the IXs, including 
CloudFlare's. As problems were detected on the IX, we would route traffic 
around them. However, several London-based CloudFlare users reported 
intermittent issues over the last several days. This is the root cause of 
those problems.

The attacks also exposed some vulnerabilities in the architecture of some 
IXs. We, along with many other network security experts, worked with the 
team at LINX to better secure themselves. In doing so, we developed a list 
of best practices for any IX in order to make them less vulnerable to 
attacks.

Two specific suggestions to limit attacks like this involve making it more 
difficult to attack the IP addresses that members of the IX use to 
interchange traffic between each other. We are working with IXs to ensure 
that: 1) these IP addresses should not be announced as routable across the 
public Internet; and 2) packets destined to these IP addresses should only 
be permitted from other IX IP addresses. We've been very impressed with the 
team at LINX and how quickly they've worked to implement these changes and 
add additional security to their IX and are hopeful other IXs will quickly 
follow their lead.

The Full Impact of the Open Recursor Problem

At the bottom of this attack we once again find the problem of open DNS 
recursors. The attackers were able to generate more than 300Gbps of traffic 
likely with a network of their own that only had access 1/100th of that 
amount of traffic themselves. We've written about how these mis-configured 
DNS recursors as a bomb waiting to go off that literally threatens the 
stability of the Internet itself. We've now seen an attack that begins to 
illustrate the full extent of the problem.

While lists of open recursors have been passed around on network security 
lists for the last few years, on Monday the full extent of the problem was, 
for the first time, made public. The Open Resolver Project made available 
the full list of the 21.7 million open resolvers online in an effort to 
shut them down.

We'd debated doing the same thing ourselves for some time but worried about 
the collateral damage of what would happen if such a list fell into the 
hands of the bad guys. The last five days have made clear that the bad guys 
have the list of open resolvers and they are getting increasingly brazen in 
the attacks they are willing to launch. We are in full support of the Open 
Resolver Project and believe it is incumbent on all network providers to 
work with their customers to close any open resolvers running on their 
networks.

Unlike traditional botnets which could only generate limited traffic 
because of the modest Internet connections and home PCs they typically run 
on, these open resolvers are typically running on big servers with fat 
pipes. They are like bazookas and the events of the last week have shown 
the damage they can cause. What's troubling is that, compared with what is 
possible, this attack may prove to be relatively modest.

As someone in charge of DDoS mitigation at one of the Internet giants 
emailed me this weekend: "I've often said we don't have to prepare for the 
largest-possible attack, we just have to prepare for the largest attack the 
Internet can send without causing massive collateral damage to others. It 
looks like you've reached that point, so... congratulations!"

At CloudFlare one of our goals is to make DDoS something you only read 
about in the history books. We're proud of how our network held up under 
such a massive attack and are working with our peers and partners to ensure 
that the Internet overall can stand up to the threats it faces.

Correction: The original sentence about the impact on LINX was "From our 
perspective, the attacks had the largest effect on LINX which for a little 
over an hour on March 23 saw the infrastructure serving more than half of 
the usual 1.5Tbps of peak traffic fail." That was not well phrased, and has 
been edited, with notation in place.

Posted by Matthew Prince
--

Cheers,
Stephen