[dns-operations] Bind 9.8.0 intermittent problem with non-recursive responses
Carlos Vicente
cvicente.lists at gmail.com
Thu May 19 17:58:19 UTC 2011
Dear lists [apologies if you receive two copies of this message],
I am in the process of implementing anycast recursive DNS service for
our campus using a combination of servers running Bind 9.8.0 and Cisco's
IP SLA feature. There are three identical Redhat servers connected to
three different routers with point-to-point /30 links. The servers are
configured with an anycast address attached to an alias of the loopback
interface:
[note: these are not the actual IP addresses]
lo:1 Link encap:Local Loopback
inet addr:192.168.32.32 Mask:255.255.255.255
UP LOOPBACK RUNNING MTU:16436 Metric:1
These caching servers are also configured as stealth slaves for our
zones (using Bind's 'also-notify' option in our master). This allows us
to serve the latest contents of our zones without having to wait for
TTLs to expire.
In our tests, we've come across a very interesting but annoying problem.
After several hours of operation, the servers start to respond to CNAME
queries in an inconsistent manner. For example:
# dig @192.168.32.32 www.uoregon.edu
; <<>> DiG 9.8.0-RedHat-9.8.0-4.uopel5 <<>> @192.168.32.32 www.uoregon.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14280
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 6, ADDITIONAL: 4
;; QUESTION SECTION:
;www.uoregon.edu. IN A
;; ANSWER SECTION:
www.uoregon.edu. 600 IN CNAME uowc-www.uoregon.edu.
uowc-www.uoregon.edu. 86400 IN A 192.168.142.125
;; AUTHORITY SECTION:
uoregon.edu. 86400 IN NS phloem.uoregon.edu.
uoregon.edu. 86400 IN NS bigdog.lsu.edu.
uoregon.edu. 86400 IN NS sns-pb.isc.org.
uoregon.edu. 86400 IN NS arizona.edu.
uoregon.edu. 86400 IN NS ruminant.uoregon.edu.
uoregon.edu. 86400 IN NS dns.cs.uoregon.edu.
;; ADDITIONAL SECTION:
phloem.uoregon.edu. 86400 IN A 192.168.32.35
phloem.uoregon.edu. 86400 IN AAAA 2001:468:d01:20::80df:2023
ruminant.uoregon.edu. 86400 IN A 192.168.60.22
ruminant.uoregon.edu. 86400 IN AAAA 2001:468:d01:3c::80df:3c16
;; Query time: 0 msec
;; SERVER: 192.168.32.32#53(192.168.32.32)
;; WHEN: Wed May 18 12:51:06 2011
;; MSG SIZE rcvd: 300
# dig @192.168.32.32 www.uoregon.edu
; <<>> DiG 9.8.0-RedHat-9.8.0-4.uopel5 <<>> @192.168.32.32 www.uoregon.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34776
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;www.uoregon.edu. IN A
;; ANSWER SECTION:
www.uoregon.edu. 600 IN CNAME uowc-www.uoregon.edu.
As you can see, the second response does not include the AUTHORITY or
the ADDITIONAL sections. This causes our users' machines to fail
to resolve the A records because the resolver library does not query a
second time. This second type of response appears to be the server
acting as an authoritative-only server, not as a caching recursive server.
Here are the most interesting details:
- We have only observed this happening when querying the anycast
address, not the address associated with the ethernet interface.
- The behavior is independent of the network. We can replicate it by
querying the anycast address from the server itself.
- Our production (non-anycast) servers run the exact same version of
Bind with the exact same configuration, and we have never observed this
problem.
- Bind's debugging output is exactly the same in both cases, so
it offers no clues about the difference in responses.
- Restarting Bind, the problem goes away for several hours. It requires
the server to receive query traffic during those hours, otherwise the
problem does not happen.
Here's the options section of the config:
options {
version "9999.9.9";
recursive-clients 5000;
directory "/etc/named";
allow-transfer { none; };
blackhole { attackers; };
listen-on-v6 { any; };
allow-recursion { customers; };
allow-query { any; };
dnssec-enable yes;
dnssec-validation yes;
};
Bind is listening on the anycast address (in addition to its NIC IP
address):
# netstat -lnp |grep 192.168.32.32
tcp 0 0 192.168.32.32:53 0.0.0.0:*
LISTEN 30771/named
udp 0 0 192.168.32.32:53 0.0.0.0:*
30771/named
These are the details of our Bind daemon (custom-built RPM, based on
Fedora's source RPM):
# named -V
BIND 9.8.0-RedHat-9.8.0-4.uopel5 built with
'--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu'
'--target=x86_64-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr'
'--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin'
'--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include'
'--libdir=/usr/lib64' '--libexecdir=/usr/libexec'
'--sharedstatedir=/usr/com' '--mandir=/usr/share/man'
'--infodir=/usr/share/info' '--with-libtool' '--localstatedir=/var'
'--enable-threads' '--enable-ipv6' '--with-pic' '--disable-static'
'--disable-openssl-version-check' '--enable-exportlib'
'--with-export-libdir=/usr/lib64'
'--with-export-includedir=/usr/include'
'--includedir=/usr/include/bind9' 'build_alias=x86_64-redhat-linux-gnu'
'host_alias=x86_64-redhat-linux-gnu'
'target_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -m64 -mtune=generic' 'CPPFLAGS=
-DDIG_SIGCHASE' 'CXXFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic' 'FFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic'
using OpenSSL version: OpenSSL 0.9.8e-rhel5 01 Jul 2008
using libxml2 version: 2.6.26
# uname -a
Linux adns1 2.6.18-238.9.1.el5 #1 SMP Fri Mar 18 12:42:39 EDT 2011
x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.6 (Tikanga)
I would really appreciate any help with this.
Thanks in advance,
More information about the dns-operations
mailing list