[dns-operations] DNS ANY record queries - Reflection Attacks

Thu Sep 13 01:28:59 UTC 2012

good morning from tokyo. this omnibus reply only includes two original
messages.

On 9/12/2012 7:53 PM, Eric Osterweil wrote:
> As I said to Vernon: I am happy to be wrong in my concerns, but I continue to feel that there is no substitute for solid analysis and measurements when infrastructure availability is on the line.

i don't agree. factually speaking, that's not how DNS was made, or how
the Internet was made, nor how the Internet is periodically destroyed
and remade in place. the methodology of everything from ethernet to IPv6
fragmentation starts as a best effort and then gets fixed on the fly.

i am not disparaging what matt called "labbing". and vernon and i have
done plenty, well before we ever let outsiders run this code. but we
stopped well short of publishing a peer reviewed journal paper before we
made the code public (though such a paper is forthcoming.)

more:

> At the very least, as you are proposing something, you ought to do proper analysis.  It is not up to me to come up with something new just because a proposal has holes.  Even if, as you contend, those holes are just a lack of proper evaluation. ;)

there's nothing improper about the evaluation we've done, other than
that it wasn't formal and public. the reason i'm taking exception to
your lack of reasoning here is that you're claiming that there are
problems but you're not describing them. you're asking questions that
were answered in the technote. you're saying "but i don't understand
XYZ" and asking for clarification, on a public mailing list, without
having either done your homework (reading what HAS been published) or
giving a problem statement.

on the business side of ISC we call this "fud" and if i appear to be
short tempered here that may be why.

> I think you owe it to the community to support your own proposal with
> real analysis and corresponding measurements. 

i think you've hit on the reason these patches are not in BIND yet.

>>> How do you distinguish a netblock with multiple resolvers, or anycast resolvers?  Perhaps more directly, are you dropping responses from legitimate clients and how do you feel about them being collateral damage?
>> those would be false negatives, which are low, for the
>> statistics-related reasons vernon has given. if i found them to be high
>> or thought that they could be high then i would be more concerned.
>> (iptables based solutions have this problem; DNS RRL does not.)
> That is conjecture.  We don't live on the back of an envelope, we live in an operational world: measurements are what matter.

well spoken. you are definitely not a candidate for running patched
versions of critical infrastructure software. i'm not sure i can agree
with your word "conjecture" since i splattered a lot of packets against
the DNS RRL wall to see what would stick. criticize my informal methods
as you see fit, but don't say i'm making this stuff up.

moreover, the code is right there. if you think it might suck, give it a
try. my test was, set up a name server running this patch, start two
clients from the same distant /24, one of which just sent back to back
queries, ignoring the responses, the other did a normal lookup then
flushed its cache. loss rate for the second client was zero after an
hour of 100Mbit/sec attacks, because the the second client retried,
understood what to do with TC=1, and had a cache (in between times i
flushed the cache.)

YMMV. i havn't done 'science' on this since my results and methods were
not public. that field is green.

> If you drop legit traffic, cause timeouts, and unreachability to someone's zone because they have deployed RRL under ANY-type reflector attacks, and then A-type reflector attacks cause RRL to shutdown their zone, I'd say you have caused harm.

and if pigs had wings, i'd say they could fly.

legit traffic gets dropped all the time. timeouts and retries happen. we
are living in the shadow of the caching recursive logic that's already
out there. what matters to me is whether a legit end-user query fails
because of DNS RRL, not whether it's slowed down because it got caught
in DNS RRL's storm-response.

> An attacker would then have the ability to cause your name servers to stop being productive to any other org's netblock, this would be a dangerous new attack vector.

thank you for this problem statement, theoretical and untested though it
is. my experience as the recipient of a many-gigabit attack is that
congestion makes *all* of my wide area dns fetches unreliable, not just
the ones that happen to produce similar results to the storm DNS RRL is
stopping.

more importantly, i have not been able to deliberately deny service by
attacking the /24 i was testing from. if you can i'd like to see your
methods and results, which will either cause me to change my claims or
will cause vernon and i to improve our methods.

>> second, if we were hurting legitimate clients, the damage would be to us
>> (the authority, since we'd be muting our content), whereas the cost of
>> doing nothing is born primarily by the DDoS victims who we answer even
>> though they are not querying us. whether this is better or worse than
>> doing nothing depends on who you're trying to protect, and the above
>> observation ("i think that's worse than doing nothing") is a total
>> nonsequitur.
> I totally agree that the DDoS threat is important.  On the other hand, opening new attack vectors that may not even address the real problem is also dangerous.

we don't open any new attack vectors, and in what possible sense are we
not addressing the real problem? (never mind that the new attack vector,
if any, would be dangerous either way.)

> ...
> This analysis misses a lot.  Resolver retx'ing 4 times in order to be likely to get a response is a big change... How long does a resolver have before the stub times out?  I know that _you_ know the stub runs the show as far as timeliness.

the stub isn't my biggest worry by a long shot. web browsers (and INNd
for some reason?) have their own caches, and operating system dns layers
do too. we're a long way from the world where my (BIND4's)
gethostbyname() was prevalent. this means the most important names are
mostly cached in and reused from three places: the app, the library, and
the recursive resolver. so their risk of failure is from new names and
rarely used names. simply put, it is _unlikely_ that a real attack,
unless very carefully targetted against a particular rare and/or new
name, will have any operational impact on "stubs".

what i worry about is starving the recursive server, which would
lengthen what i call "the kaminsky window" during which cache pollution
attacks can be effective in far less time. but as i said i was not able
to weaponize that vector so to me it remains theoretical. (help wanted.)

and no, resolver retrying in order to get a response is not a big
change. all recursive servers do this, even if the threat they usually
face is "BGP path changed -- let's keep trying until the tables coalesce."

>>>> So, every identical response either gets dropped or gets its TC bit set?
>>> No, every *excessive* identical response is either not sent (dropped)
>>> or a tiny TC=1 response is sent instead.
>> moreover, the definition of the word "identical" is not what one would
>> expect. perhaps we should say "vastly similar" rather than "identical".
>> one of the things DNS RRL counts is the number of times a negative
>> answer is generated, per-client-netblock, per-SOA-apex. these responses
>> are not identical but they all flow from the same SOA. another thing we
>> count is the number of times a wildcard is used per-client-netblock.
>> these responses are in no way identical but we treat them as such for
>> the purpose of rate limiting. these are things i do not think a firewall
>> can do unless it's so DNS-aware that it knows where the apex is, knows
>> what names exist, and knows what wildcards exist. (more on that in my
>> response to colm's thread.)
> This has all been very fluffy and nonspecific.

did you read the technote, and have you found it fluffy and nonspecific?
or are you complaining that my answers here are fluffy and nonspecific?
as the primary author of the technote i accept responsibility for any
part of it that's fluffu or nonspecific. the part that corresponds to
the above-quoted paragraph is:

   3 - Responder Behaviour

   3.1. When generating a response, a server will take the requestor's IP
   address and mask it according to either IPV4-PREFIX-LENGTH or
   IPV6-PREFIX-LENGTH, and then impute a domain name which is either a
   wildcard name (if a wildcard match occurred) or the zone name (if no
   match occurred) or the query name, and a boolean error indicator (was
   the response code REFUSED, FORMERR or SERVFAIL, or was it not?), and use
   this tuple <mask(IP), imputed(NAME), errorstatus> to select a state
   blob, creating one if necessary.

   3.2. If the selected state blob indicates that this response has been
   sent too often to requestors on this network, then consider whether to
   send a truncated response (due to SLIP) or no response. In any case
   increment a counter to indicate that the response has been considered.

   3.3. When a state blob's counter has not been incremented within WINDOW
   seconds then discard the state blob.

   3.4. In the event that the creation of a new state blob would cause the
   table to exceed MAX-TABLE-SIZE, the least recently used state blob
   should be discarded.

   3.5. Noting: Conceptually speaking, a state blob is either filling,
   full, or draining.  To be filling means that the rate limit has not been
   exceeded. To be full means that the rate limit has been exceeded. To be
   draining means that the rate limit was once exceeded and the rate has
   not yet returned to zero.

this text comes from http://www.redbarn.org/dns/ratelimits/

if it's unclear, please question it, i would like to improve clarity and completeness.

>> it's possible that you've imagined a weakness by which a new kind of
>> attacker could target the DNS RRL machinery in a way that mutes goodput,
>> where this muting, and not DDoS, is the goal of the attack. i invite you
>> to code this up and demonstrate it. my concern in this regard is muting
>> an authority server during a kaminsky-style attack on some caching
>> resolver in order to lengthen the poison-attack window. but i was not
>> able to make it work in the current DNS RRL design. "help wanted."
> I fear that the attacks we already see will do this for me... ANY-type attacks are the flavor de jour, but they are not the only ones out there...

i invite weaponization of this concept. also note this section of the
technote:

5 - Attacker Behaviour

   5.1. A forged-source reflective amplifying attacker who wants to be
   successful will either have to select authority servers who do not
   practice rate limiting yet, or will have to select a large number of
   authority servers and use round robin to distribute the attack flows.
   Each authority server will have to be asked a question within one of
   that server's zones chosen at random in order to get an amplification
   effect. An attacker would do well to select DNSSEC-signed zones and to
   use DNSSEC signalling in their forged queries to maximize response size.
   This will be more effective than QTYPE ANY queries which are often
   blocked altogether due to their diagnostic rather than operational
   purpose.

if possible i'd like to add other attacker options to this section.

>> thus you can see that DNS rate limiting's design is rooted in economics
>> while still governed by technology. we are coming up with solutions that
>> we can ask involved parties to implement. what's your idea?
> Also thus, you can see that the potential collateral damage done by under-analyzed approaches can outweigh optimistic appraisals... 

you have not demonstrated this or even given an untested theory under
which it might be true. i have no way to evaluate this baseless claim.
please send diffs to the technote or code, or answer vernon's math quiz
differently, or weaponize your concern into an effective demonstration.

> My idea to to verify our work so that we actually know it's merits. :)

my idea is rough consensus and running code, where peer review includes
people who read and run source code patches.

On 9/12/2012 7:44 PM, Eric Osterweil wrote:
> On Sep 11, 2012, at 7:28 PM, Vernon Schryver wrote:
>
>> The money-back warranty on the BIND RRL patch only covers its installation
>> on authority DNS servers, although I have received positive reports
>> from resolver operators.  The current version includes additional
>> features suggested by operators of combined authorities/resolvers.
> Have those reports mostly been surrounding ANY queries (as this thread does)?

no.

> This, I believe, is a very fundamental confirmation bias.  ANY queries are not widely used for much of anything (qmail not withstanding).

it wouldn't matter. we're looking at duplicate responses, so qtype is
just part of the response from our point of view. even a meta qtype like
ANY is just bits in a hash to us. in other words we care about qtype
because it is part of the identity of the response, not because it may
have some special processing associated with it (as ANY does).

> ... otoh, if you apply this to _all_ qtypes (including A), and there is a large scale A reflector attack, I _strongly_ suspect the people who have deployed RRL will have a great many angry clients...

and yet, they do not.

> I feel the confirmation of this approach is far too narrow if it only includes ANY query attacks.

so would i. that's not the situation.

>> ...  With default parameters, the
>> BIND RRL drops 50% of responses and substitutes a small TC=1 response
>> for the other 50%.  That gives an amplification for the responses it
>> sends of <= 1.0 responses sent and an overall amplification <= 0.5.
>> (It currently forgets about ENDS in the TC=1 responses giving a default
>> amplification of < 0.5.  An influential commentator calls that a bug.)
> OK, this is beginning to become clearer... But I have to admit, this still seems worrisome to me.  If you drop 50% of legit traffic (a generous assumption as it assumes a uniform distribution, which is not established by any of the analysis I have seen), and the other 50% (that you service as TC-bit mini-responses) comes back to you as TCP.

no. only one or two real clients will come back with TCP. the other ~50%
of TC=1 responses fall on deaf ears. only a real client who is waiting
on the right <ip,port> for the right TXID will hear their TC=1. the rest
is just small-packet DDoS (reflected with attenuation, and truncated
rather than amplified).

> Thus, you have taken your own processing requirements way up (as your clients will now all hit you over TCP instead of UDP).

no. just, no. please think these things through before addressing a
whole mailing list like this. vernon or i (or both of us) are available
for free telephone or e-mail or whiteboard consulting with anyone who
wants to know how DNS RRL works or who wants to walk through their list
of "but what about XYZ". (i now wish that i had notes from the
year-or-so of private conversation between vernon and i about "foreach
XYZ" that led to the design we finally published.)

paul