[dns-operations] Do Unix stubs round robin nameserver addresses?
Chuck Anderson
cra at WPI.EDU
Fri Apr 17 22:44:48 UTC 2015
On Fri, Apr 17, 2015 at 03:06:45PM -0700, Doug Barton wrote:
> I have always believed (based on both the man pages, and what I've
> seen in the field) that Unix stub resolvers follow the behavior
> described in the man page. That is, they try the first 'nameserver'
> address listed, and if it doesn't get a response before the timeout
> value expires it then moves on to the next one in line.
>
> I was having a discussion with someone about that issue today who
> insists that they have empirical evidence that this is not the case,
> that they have seen stubs that round robin the addresses. So, I'm
> wondering if y'all have seen the same thing?
It is configurable. See resolv.conf(5), specifically the "rotate"
option.
Unix stub resolvers are a mess--it is hopeless to rely on any kind of
sane failover behavior with multiple nameservers listed in
/etc/resolv.conf. Many servers/applications will hang if the first
listed nameserver is down, or at least take so long to failover to the
next nameserver that your service/application is effectively dead.
Usually each new incoming request will start over at the first
nameserver. Finally, most long-running processes won't bother to
re-read /etc/resolv.conf if it changes, so even if you change the
order during an outage (see [1]), it won't help.
[1] http://kvz.io/blog/2013/03/27/poormans-way-to-decent-dns-failover/
"nsfailover" is a nice idea, but it doesn't work in practice for
long-running server processes. It might be okay for desktop systems.
The problem results from the fact that there is no system-wide state
that is kept to maintain the status of each of the nameservers listed
in /etc/resolv.conf. The C library keeps this state for each process
and/or thread. If you have a server process that spawns a new thread
or process for each incoming request, each process/thread will start
over at the first nameserver and go through the timeout process until
it finds a working nameserver. It may even be as bad as every new DNS
request in the SAME process starts over from the first one.
RES_TIMEOUT defaults to 5 seconds, and RES_DFLRETRY defaults to 2. So
each DNS query could potentially hang for up to 10 seconds unless you
have a really smart application that does the right thing and/or
implements its own stub resolver.
Windows doesn't have a this problem because it comes with a
system-wide DNS cache by default.
OS X I'm not sure about, it may also come with a cache.
The Linux folks are working on solutions. One attempt is
systemd-resolved.
But you can't rely on such nice client-side solutions/behavior because
most Unix systems are still broken out-of-the-box. As a DNS resolver
operator for my campus, I've come to this unfortunate conclusion after
months of research and testing. The only sane thing to do is:
1. Run a system-wide DNS caching resolver on 127.0.0.1, and point
/etc/resolv.conf to that.
or
2. Use anycast to make your multiple DNS servers appear as one IP, and
put that one IP in /etc/resolv.conf. You can have multiple IPs,
but each one should still be anycasted.
I know my answer was way more than you asked for, but I had to take
this chance to get the word out :-)
More information about the dns-operations
mailing list