[dns-operations] omnibus reply (Re: solutions for DDoS mitigation of DNS)

Thu Apr 2 20:03:41 UTC 2020

there has been quite a bit of factual confusion on this thread while i slept; 
so much so that i can't really figure out where to chime in most usefully. so 
i'll answer three questions which seem most pertinent, choosing the best 
example of each question from the thread before me.

---

first: On Thursday, 2 April 2020 12:54:41 UTC Tessa Plum wrote:
> On 2020/4/2 5:39 下午, Ray Bellis wrote:
> > If it's an authoritative server, turn on Response Rate Limiting (RRL) if
> > it's BIND, or the equivalent feature if is isn't.
> 
> Yes they are authoritative servers.
> Does RRL work based on IP addr? but the requesting IP seems spoofed.

the authoritative reference to all things DNS RRL related is here:

http://www.redbarn.org/dns/ratelimits

which refers to this document:

http://family.redbarn.org/~vixie/isc-tn-2012-1.txt

which answers your question as follows:

> ISC-TN-2012-1-Draft1   DNS Response Rate Limiting             April-2012
> 
>    3 - Responder Behaviour
>    
>    3.1. When generating a response, a server will take the requestor's IP
>    address and mask it according to either IPV4-PREFIX-LENGTH or
>    IPV6-PREFIX-LENGTH, and then impute a domain name which is either a
>    wildcard name (if a wildcard match occurred) or the zone name (if no
>    match occurred) or the query name, and a boolean error indicator (was
>    the response code REFUSED, FORMERR or SERVFAIL, or was it not?), and use
>    this tuple <mask(IP), imputed(NAME), errorstatus> to select a state
>    blob, creating this if necessary.
>    
>    3.2. If the selected state blob indicates that this response has been
>    sent too often to requestors on this network, then consider whether to
>    send a truncated response, or a leaked response, or no response. In any
>    case increment a counter to indicate that the response has been
>    considered.
>    
>    3.3. When a state blob's age goes over WINDOW, and its counter has not
>    been incremented within WINDOW, then discard the state blob.
>    
>    3.4. In the event that the creation of a new state blob would cause the
>    table to exceed MAX-TABLE-SIZE, the least recently used state blob
>    should be discarded.
>    
>    3.5. Noting: Conceptually speaking, a state blob is either filling,
>    full, or draining.  To be filling means that the rate limit has not been
>    exceeded. To be full means that the rate limit has been exceeded. To be
>    draining means that the rate limit was once exceeded and the rate has
>    not yet returned to zero.

the document is short, and worthy of reading or re-reading.

---

second, On Thursday, 2 April 2020 10:22:21 UTC Jim Reid wrote:
> > On 2 Apr 2020, at 11:10, Davey Song <songlinjian at gmail.com> wrote:
> > I'm very confused that why people on the list are suggesting RRL (even
> > BCP38) to the victim of DoS attack? If I remember correctly, the goal of
> > both RRL and BCP38 is to reduce the chance of participating the attack as
> > a innocent helper.
> RRL won’t help with the volume of incoming queries. It will however reduce
> the volume of outgoing responses which may well be DoS’ing another innocent
> victim.

this is true as far as it goes, but does not go far enough. some attacks are 
against distant victims whose source-ip's are therefore spoofed, and in that 
case the source-ip's are in a narrow band and DNS RRL will prevent an 
authority server from participating as an amplifier of that attack.

other attacks are against the authority server itself, and so the spoofed-
source IP addresses are somewhat irrelevant; they do not identify a victim, 
they are merely randomized in order to hide the identity of the attacker. in 
this case the attack is against the authority server's capacity, which can be 
seen as three critical resources: inbound network path, outbound network path, 
and server CPU.

if the attack is large enough to congest your inbound network path, then your 
only fix is to add more servers in other locations (having different inbound 
network paths.) you may also want to consider a service like "akamai 
cleanfeed" which can, with your cooperation, advertise via global BGP the 
address of your attacked servers, and route that traffic through a "scrubbing 
center". akamai has competitors in this arena, but they're the one i know 
well.

if the attack is not large enough to congest your inbound network path, then 
it may be possible to protect your outbound network path or your CPU using 
some kind of filtering. DNS RRL is an example of such filtering. by not 
answering predicted-to-be-spoofed queries, you save the CPU time used to 
assemble responses, and you save the outbound network capacity needed to 
transmit such responses.

some have asked, isn't this a trivial obstruction that a correctly functioning 
attacker can bypass with creative randomness in their spoofed-source IP 
address generator? and the answer is, "yes that's usually true". however, not 
all attackers function correctly in this regard, and of those who do, there is 
a maximum number of /24 (or /56) flow buckets they can use, which at 20 Gb/s 
in IPv4 requires reuse, which will lead DNS RRL to attenuate non-victim flows.

in IPv6 the number of possible buckets is far greater than DNS RRL's state 
capacity and so in that case you'll need something like strict-mode uRPF, 
which requires a full routing table (no default route) so that you can drop 
all packets whose source address isn't covered by an explicit BGP path. this 
way lies madness, and if you're up against a correctly functioning attacker, 
you'll lose more often than you'll win, and you'll need a "scrubbing service" 
that can take over the advertisement of your network's reachability during 
times of DDoS, and which does not require you to host your servers elsewhere.

---

third and last, On Thursday, 2 April 2020 11:37:36 UTC Klaus Darilion wrote:
> ... It is not that I
> argue against rate limiting, but that admins should be aware when it
> actually helps, and when not. ...
> 
> We also used rate limiting with dnsdist, but due to the mentioned
> problems we switched to high performance backends for the zones which
> are under constant attack.

there is never a time when DNS RRL won't help, but it may not be _enough_.

DNS RRL should be the default for all authority servers, subject to tuning, 
but never requiring knowledge or action by operators.

if you turn on DNS RRL on an authority server that you didn't think was being 
abused or attacked, you will see a drop in your egress traffic.

turn it on and keep it on. use the default recommended settings unless you're 
interested in operational research.

once that's been done, solve whatever problems you still have, along the lines 
i explained last night:

* subscribe to a "DDoS scrubbing service"

* add more network capacity

* use local anycast to increase the per-logical-server capacity

* add more secondary servers

open source DNS software and OSPF ECMP is adequate here, you do not need a 
commercial load balancer nor a commercial DNS appliance.

again, DNS RRL has no downside. i hereby call upon all DNS vendors to make it 
their default.

-- 
Paul