[dns-operations] The (very) uneven distribution of DNS root servers on the Internet

Wed May 16 15:24:13 UTC 2012

Hi Andrew,

On 2012-05-15, at 13:03, Andrew Sullivan wrote:

> On the root-servers.org site, however (which seems
> to be the source of at least some of the data pingdom is using), there
> isn't anywhere one can find a description of how sites are chosen by
> the various operators nor what factors determine provisioning in those
> sites.  

Since you asked: for us, right now, the criteria for a host to host an anycast instance of L-Root are:

 - ability to speak BGP
 - willingness and ability to enter into a $0 contract with ICANN
 - willingness to buy a server to ICANN's spec and host it

The question of "how sites are chosen" for us is, flippantly, that they are not. If someone thinks it is worthwhile hosting a copy of L-Root and they meet the criteria above, that's good enough for us.

I agree that it would be good to be able to find this kind of information more easily from www.root-servers.org.

> Let me make up an implausiblr scenario to illustrate why this might
> matter, and how additional coverage could in principle get worse by
> adding more nodes (an issue not clear in the pingdom article).
> Suppose that in Viet Nam there is an IX, that most of the ISPs in Viet
> Nam have a presence in that IX, and that all the pariticpans in the IX
> have free and easy peering policies.  Suppose also that the ISPs in
> that IX all have extremely good connections to Singapore and Malaysia.
> As a result of all of this, ISPs all have extremely good connectivity
> to F, I, and J.  Suppose L puts a node in the Vietnamese IX.  Service
> is improved in the sense that RTT on root queries goes down when
> they're directed to L.  However, if L accidentally puts an
> underprovisioned node in Viet Nam, and it is sometimes overwhelmed,
> then service actually gets _worse_: the overwhelmed node sometimes
> drops queries or crashes or traffic gets routed elsewhere; in any
> case, there is additional latency that results from having to recover
> from the overload condition.

We're interested in the stability of the system overall, and are not fixated on optimising every variable for L. By taking a different approach to other root server operators (e.g. focussing on deployment in ISP networks rather than at exchange points, deploying many small boxes instead of a smaller number of bigger ones, deploying in new locations using different criteria) we're adding diversity of approach to the system.

There are undoubtedly scenarios such as the one you described that will cause a root server's service to be degraded for some users. I think the important thing, however, is that the extent to which individual root servers' infrastructure are affected by any individual scenario be usefully different. It's not possible to prevent every possible failure mode; what we aim for is to make the system as a whole more robust by incorporating as much diversity as possible. That diversity is important in operations and deployment, just as it's important in (for example) operating system, network equipment vendors, DNS software used, etc.

Joe