[dns-operations] Bind 9.8.0 intermittent problem with non-recursive responses

Thu May 19 18:49:05 UTC 2011

Carlos,

I've observed the same behavior with BIND 9.8.0 running on generic IPv4 assigned to an Ethernet interface, not using loopback with AnyCast.  Odds are good this is a software bug in BIND.  Same behavior observed on two nearly identical platforms, while on two others I've not run into the same issues.

Best I could determine, the problem became apparent after some duration of runtime and/or queries or query volume.  On servers that only handle inside "trusted" users I've not seen the problem at all and they're still running 9.8.0 today.  On external Internet-facing servers where the problem was triggered almost daily we rolled back to 9.7.x until a fix is released (or 9.8.1, and we'll try again).

FYI, server O/S in my case is CentOS 5.6 32-bit, should be equivalent to Red Hat.

Hopefully an ISC POC will contact you directly.  Send configs and they'll probably assist in debugging.

-----Original Message-----
From: dns-operations-bounces at lists.dns-oarc.net [mailto:dns-operations-bounces at lists.dns-oarc.net] On Behalf Of Carlos Vicente
Sent: Thursday, May 19, 2011 1:58 PM
To: bind-users at lists.isc.org; dns-operations at lists.dns-oarc.net
Subject: [dns-operations] Bind 9.8.0 intermittent problem with non-recursive responses

Dear lists [apologies if you receive two copies of this message],

I am in the process of implementing anycast recursive DNS service for
our campus using a combination of servers running Bind 9.8.0 and Cisco's
IP SLA feature. There are three identical Redhat servers connected to
three different routers with point-to-point /30 links. The servers are
configured with an anycast address attached to an alias of the loopback
interface:

[note: these are not the actual IP addresses]

lo:1      Link encap:Local Loopback
          inet addr:192.168.32.32  Mask:255.255.255.255
          UP LOOPBACK RUNNING  MTU:16436  Metric:1

These caching servers are also configured as stealth slaves for our
zones (using Bind's 'also-notify' option in our master). This allows us
to serve the latest contents of our zones without having to wait for
TTLs to expire.

In our tests, we've come across a very interesting but annoying problem.
After several hours of operation, the servers start to respond to CNAME
queries in an inconsistent manner. For example:

# dig @192.168.32.32 www.uoregon.edu

; <<>> DiG 9.8.0-RedHat-9.8.0-4.uopel5 <<>> @192.168.32.32 www.uoregon.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14280
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 6, ADDITIONAL: 4

;; QUESTION SECTION:
;www.uoregon.edu.		IN	A

;; ANSWER SECTION:
www.uoregon.edu.	600	IN	CNAME	uowc-www.uoregon.edu.
uowc-www.uoregon.edu.	86400	IN	A	192.168.142.125

;; AUTHORITY SECTION:
uoregon.edu.		86400	IN	NS	phloem.uoregon.edu.
uoregon.edu.		86400	IN	NS	bigdog.lsu.edu.
uoregon.edu.		86400	IN	NS	sns-pb.isc.org.
uoregon.edu.		86400	IN	NS	arizona.edu.
uoregon.edu.		86400	IN	NS	ruminant.uoregon.edu.
uoregon.edu.		86400	IN	NS	dns.cs.uoregon.edu.

;; ADDITIONAL SECTION:
phloem.uoregon.edu.	86400	IN	A	192.168.32.35
phloem.uoregon.edu.	86400	IN	AAAA	2001:468:d01:20::80df:2023
ruminant.uoregon.edu.	86400	IN	A	192.168.60.22
ruminant.uoregon.edu.	86400	IN	AAAA	2001:468:d01:3c::80df:3c16

;; Query time: 0 msec
;; SERVER: 192.168.32.32#53(192.168.32.32)
;; WHEN: Wed May 18 12:51:06 2011
;; MSG SIZE  rcvd: 300

# dig @192.168.32.32 www.uoregon.edu

; <<>> DiG 9.8.0-RedHat-9.8.0-4.uopel5 <<>> @192.168.32.32 www.uoregon.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34776
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.uoregon.edu.		IN	A

;; ANSWER SECTION:
www.uoregon.edu.	600	IN	CNAME	uowc-www.uoregon.edu.

As you can see, the second response does not include the AUTHORITY or
the ADDITIONAL sections. This causes our users' machines to fail
to resolve the A records because the resolver library does not query a
second time. This second type of response appears to be the server
acting as an authoritative-only server, not as a caching recursive server.

Here are the most interesting details:

- We have only observed this happening when querying the anycast
address, not the address associated with the ethernet interface.
- The behavior is independent of the network. We can replicate it by
querying the anycast address from the server itself.
- Our production (non-anycast) servers run the exact same version of
Bind with the exact same configuration, and we have never observed this
problem.
- Bind's debugging output is exactly the same in both cases, so
it offers no clues about the difference in responses.
- Restarting Bind, the problem goes away for several hours. It requires
the server to receive query traffic during those hours, otherwise the
problem does not happen.

Here's the options section of the config:

options {
   version "9999.9.9";
   recursive-clients 5000;
   directory "/etc/named";
   allow-transfer { none; };
   blackhole { attackers; };
   listen-on-v6 { any; };
   allow-recursion { customers; };
   allow-query { any; };
   dnssec-enable yes;
   dnssec-validation yes;

};

Bind is listening on the anycast address (in addition to its NIC IP
address):

# netstat -lnp  |grep 192.168.32.32
tcp        0      0 192.168.32.32:53            0.0.0.0:*
    LISTEN      30771/named
udp        0      0 192.168.32.32:53            0.0.0.0:*
                30771/named

These are the details of our Bind daemon (custom-built RPM, based on
Fedora's source RPM):

# named -V
BIND 9.8.0-RedHat-9.8.0-4.uopel5 built with
'--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu'
'--target=x86_64-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr'
'--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin'
'--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include'
'--libdir=/usr/lib64' '--libexecdir=/usr/libexec'
'--sharedstatedir=/usr/com' '--mandir=/usr/share/man'
'--infodir=/usr/share/info' '--with-libtool' '--localstatedir=/var'
'--enable-threads' '--enable-ipv6' '--with-pic' '--disable-static'
'--disable-openssl-version-check' '--enable-exportlib'
'--with-export-libdir=/usr/lib64'
'--with-export-includedir=/usr/include'
'--includedir=/usr/include/bind9' 'build_alias=x86_64-redhat-linux-gnu'
'host_alias=x86_64-redhat-linux-gnu'
'target_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -m64 -mtune=generic' 'CPPFLAGS=
-DDIG_SIGCHASE' 'CXXFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic' 'FFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic'
using OpenSSL version: OpenSSL 0.9.8e-rhel5 01 Jul 2008
using libxml2 version: 2.6.26

# uname -a
Linux adns1 2.6.18-238.9.1.el5 #1 SMP Fri Mar 18 12:42:39 EDT 2011
x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.6 (Tikanga)

I would really appreciate any help with this.

Thanks in advance,
_______________________________________________
dns-operations mailing list
dns-operations at lists.dns-oarc.net
https://lists.dns-oarc.net/mailman/listinfo/dns-operations
dns-jobs mailing list
https://lists.dns-oarc.net/mailman/listinfo/dns-jobs