[dns-operations] rate-limiting state

Fri Feb 7 17:35:24 UTC 2014

Colm MacCárthaigh wrote:
> ... But [RRL] also increases the collateral damage to your legitimate
> users. If an attacker spoofs a popular resolver, then for just 5-10
> queries per second they cause a degradation in service to the real
> legitimate users of that resolver. With the default settings, 12.5% of
> queries from that resolver may not get any answer at all, even with
> three attempts, and the lookup time is increased by about 1.3 RTTs on
> average.  With the resolver trying 3 authoritative nameservers, the
> availability hit diminishes to about 0.2% (which brings us to two
> nines), but the RTT hit gets worse.

You're calling it damage for some reason, even though it's not
stub-visible. RRL does of course rely on retries and TC=1 and other
known recursive-dns behaviour. That's because RRL is a protocol-specific
method. A non-DNS protocol would probably call for a different method.
If we take your 30% average RTT impact to heart, we've moving the needle
for a stub transaction from 20ms to 25ms. I'm unwilling to call that
damage, collateral or otherwise.

> Now if I have a botnet or client that can generate 1M PPS (this is
> small, but adjust to any number), I can try to spoof 66,666 popular
> resolvers (this is a knowable set) at 5 QPS each to 3 auth servers, I
> can use RRL to degrade service in a more widespread way. 
>
> Now, let's say you have the capacity to answer these queries (which is
> realistic for some) which behavior is better for your users? Just
> answering the responses? Or rate-limiting the responses?

Rate limiting is always better, given that recursive servers will retry,
will act on TC=1, and will stop asking once they cache the result.

> My overall point is that with RRL there is some trade-off between
> protecting innocent reflection victims and opening yourself to an
> attack that degrade service to your real users in some way. Were RRL
> to be widely deployed, attacks could shift to table-exhaustion and
> popular-resolver spoofing and be effective in different ways.

There is no operable trade-off of the kind you're proposing. RRL makes
everyone's life better except the attackers, in all cases. The "degrade"
you're describing is far better than the non-RRL case, and is in any
case not user-visible. Are you criticizing RRL for using the known
behaviour of recursive servers (retrying, respecting TC=1, ceasing to
ask once an answer is obtained) deliberately to increase resiliency?

Separately, I dispute your implication that there's a table-exhaustion
condition that can be hit. The design of RRL takes table size into
account. I am, as before, ready to evaluate your experimental results if
you can show otherwise.

Vixie