[dns-operations] Best practices for Linux/UNIX stub resolver failover

Fri May 2 09:04:41 UTC 2014

On 22 Apr 2014, at 20:04, Chuck Anderson <cra at WPI.EDU> wrote:

> Is it really expected that the first DNS server listed in
> /etc/resolv.conf should never go down?  Operationally speaking, who
> can actually rely on listing multiple nameservers in /etc/resolv.conf
> and using libc's failover mechanism in any kind of production server?
> Because the failover behavior in libc is atrocious--each new or
> existing process has to re-do the failover after timing out, and even
> long-running processes have to call res_init() to re-read resolv.conf.
> It seems that the only sensible way to run a datacenter (or a network
> full of Linux workstations for that matter) is to either:
> 
> 1. Make sure the first nameserver listed in resolv.conf never goes
>   down by using Anycast DNS or some other failover mechanism like
>   VRRP or CARP on the DNS server side.
> 
> or:
> 
> 2. Use a local DNS daemon on every server with forwarders configured
>   to the network's nameservers, and fix resolv.conf to 127.0.0.1.
> 

Not an expert either but I remember having bad experiences of this on Solaris 8/10 and early versions of Linux many years ago but I was under the impression that modern versions of *nix had this covered, certainly the quick test I just did on a RHEL6 server makes me think this is no longer an issue.

The server I have has a resolv.conf configured as follows:

nameserver	192.168.1.244
nameserver	192.168.1.1

If I stop the resolver running on 192.168.1.244 and then generate a query I see the following in a tcpdump

09:57:01.348084 IP 192.168.1.173.47658 > 192.168.1.244.domain: 35734+ A? pool.ntp.org. (30)
09:57:01.348160 IP 192.168.1.173.47658 > 192.168.1.244.domain: 61935+ AAAA? pool.ntp.org. (30)
09:57:01.348223 IP 192.168.1.173.42169 > 192.168.1.1.domain: 35734+ A? pool.ntp.org. (30)
09:57:01.348267 IP 192.168.1.173.42169 > 192.168.1.1.domain: 61935+ AAAA? pool.ntp.org. (30)

This is very quick failover to the second resolver.

Or did I mis-understand the problem.

Brett