[dns-operations] resolvers considered harmful

Thu Oct 23 15:43:31 UTC 2014

On Oct 22, 2014, at 23:03 , Mark Allman <mallman at icir.org> wrote:

> 
> 
> The paper quantifies this cost for .com.  We find that something like 1%
> of the records change each week.  So, while increasing the TTL from the
> current two days to one week certainly sacrifices some possible
> flexibility, in practical terms the flexibility isn't being used.

I think your definition of “used” is flawed.  1% of the records in .com is actually over a million delegations.  That seems like quite a lot of use, to me.  

But, let’s consider the actual effects of changing TTLs in a delegation-centric zone.

Today, the TTL of NS records at the parent are, by and large, ignored by resolvers.  As non-authoritative data, it’s my understanding that, in most resolvers, the TTL of the parent NS set is overwritten by the TTL of the authoritative NS records in the child zone.  Under current normal operation, changing the TTL in the TLD would have little or no effect on the frequency of queries to the TLD.

If resolvers were to begin paying more attention to the parent TTL, then raising the TTL in the TLD zone would cause that 1% of customers (over a million a week) to have to wait a week every time they update their NS set for the old records to expire.  That’s a significant operational change for zone operators.  

It’s even worse for DNSSEC, where the DS record in the parent zone is actually authoritative data.  Forcing people to take a minimum of a week to do a security roll isn’t going to go over well.  Even the standard one or two days that most TLDs use today is too long for some people in this case.

> 
>  - As noted in the paper 93% of the zones see no increase in our
>    trace-driven simulations.  That is, they are accessed by at most one
>    end host per TTL and therefore see no benefit from the shared cache
>    and hence will see the same load regardless of whether it is an end
>    host or a shared resolver asking the questions.

How does this compare to resolvers with one or two (or four) orders of magnitude more clients behind them?  You were watching a network with roughly clients 100 behind a revolver; this doesn’t seem to be representative of the Internet at large, where a very large number of the clients are served by a very small number of recursive servers.  Have a look at recent work from Geoff Huston.. While I can’t put my hands on the reference at the moment, I seem to recall him having data that suggests ~25% of the clients sit behind ~1% of the resolvers (I can find the reference[1] that puts 16% of the Internet behind Google alone).  That’s a very different world from the one extrapolated from 100 users behind each resolver.

> 
>  - Or, put differently ... We are not pretending that there is no
>    additional cost at some auth servers.  But, this additional cost
>    does buy us things.  So, it is simply a different tradeoff than we
>    are making now.

It’s externalizing costs, not a trade-off.  One entity is not making a change and then gaining some things and losing others.  One entity is making a change (e.g. an ISP shutting down its resolver) and gaining a reduction in expenses.  Meanwhile other entities (e.g. their end users) lose some processor cycles to their new private resolvers, and some time to increased RTT due to cache misses (see above why I don’t accept that this is not a problem).  A third set of entities (authoritative operators) lose quite a lot to a significant increase in operational costs.  

> - There is also a philosophical-yet-practical argument here.  That is,
>    if I want to bypass all the shared resolver junk between my laptop
>    and the auth servers I can do that now.  And, it seems to me that
>    even given all the arguments against bypassing a shared resolver
>    that should be viewed as at least a rational choice.  So, in this
>    case the auth zones just have to cope with what shows up.  So, do we
>    believe that it is incumbent upon (say) AT&T to provide shared
>    resolvers to shield (say) Google from a portion of the DNS load?
>    Or, put differently, the results in the paper suggest that there
>    really isn't much for AT&T to gain from providing those resolvers,
>    so why should it?  One argument here could be that AT&T is trying to
>    provide its customers better performance.  But, the paper shows this
>    is really not happening (which is largely a function of pervasive
>    DNS prefetching).  So, if I am AT&T I'd be thinking "hey, what am I
>    or my customers actually gaining from this complexity I have in my
>    network?!".  And, if the answer is little-to-nothing then it seems
>    rational to not provide this service.  Or, so it seems to me.

It doesn’t look to me like your paper has done anything to capture what it looks like behind AT&T’s resolvers, so I’m not sure how you can come to that sort of conclusion.  In 2012, AT&T had around 107 million mobile users alone (I found that number[2] more easily than their home Internet users, so I’m using that).  I can guarantee you AT&T isn’t running a million separate recursive resolvers.  It’s easily in the very-low-thousands of servers (likely less than 1000 country-wide).  The cache hit/miss ratios in that environment are entirely different from your study, but far more representative of the average user’s experience.

I actually support end-users having their own iterative resolver, as it fixes all of the last-mile problems in DNSSEC validation.  However, the benefit of the shared cache cannot be denied .. or, at least, it hasn’t been denied by this study.

[1]: Measuring DNSSEC, Geoff Huston 
     <http://www.potaroo.net/presentations/2014-06-03-dns-measurements.pdf>

[2]: <http://www.fiercewireless.com/special-reports/grading-top-10-us-carriers-fourth-quarter-2012>