[dns-operations] Bind 9.8.0 intermittent problem with non-recursive responses

Carlos Vicente cvicente.lists at gmail.com
Thu May 19 22:45:37 UTC 2011


Hi Patrick,

This is interesting. I just realized that the problem is not exclusive
of my anycast servers. I noticed that my authoritative-only servers
were not returning the ADDITIONAL section either, so I restarted BIND,
and they started doing so.

So this does look more clearly like some kind of bug in BIND. I'll try
to open a case with ISC.

Thanks for your reply.

cv

On Thu, May 19, 2011 at 11:49 AM, Patrick, Robert (CONTR)
<Robert.Patrick at hq.doe.gov> wrote:
> Carlos,
>
> I've observed the same behavior with BIND 9.8.0 running on generic IPv4 assigned to an Ethernet interface, not using loopback with AnyCast.  Odds are good this is a software bug in BIND.  Same behavior observed on two nearly identical platforms, while on two others I've not run into the same issues.
>
> Best I could determine, the problem became apparent after some duration of runtime and/or queries or query volume.  On servers that only handle inside "trusted" users I've not seen the problem at all and they're still running 9.8.0 today.  On external Internet-facing servers where the problem was triggered almost daily we rolled back to 9.7.x until a fix is released (or 9.8.1, and we'll try again).
>
> FYI, server O/S in my case is CentOS 5.6 32-bit, should be equivalent to Red Hat.
>
> Hopefully an ISC POC will contact you directly.  Send configs and they'll probably assist in debugging.
>
> -----Original Message-----
> From: dns-operations-bounces at lists.dns-oarc.net [mailto:dns-operations-bounces at lists.dns-oarc.net] On Behalf Of Carlos Vicente
> Sent: Thursday, May 19, 2011 1:58 PM
> To: bind-users at lists.isc.org; dns-operations at lists.dns-oarc.net
> Subject: [dns-operations] Bind 9.8.0 intermittent problem with non-recursive responses
>
> Dear lists [apologies if you receive two copies of this message],
>
> I am in the process of implementing anycast recursive DNS service for
> our campus using a combination of servers running Bind 9.8.0 and Cisco's
> IP SLA feature. There are three identical Redhat servers connected to
> three different routers with point-to-point /30 links. The servers are
> configured with an anycast address attached to an alias of the loopback
> interface:
>
> [note: these are not the actual IP addresses]
>
> lo:1      Link encap:Local Loopback
>          inet addr:192.168.32.32  Mask:255.255.255.255
>          UP LOOPBACK RUNNING  MTU:16436  Metric:1
>
> These caching servers are also configured as stealth slaves for our
> zones (using Bind's 'also-notify' option in our master). This allows us
> to serve the latest contents of our zones without having to wait for
> TTLs to expire.
>
> In our tests, we've come across a very interesting but annoying problem.
> After several hours of operation, the servers start to respond to CNAME
> queries in an inconsistent manner. For example:
>
> # dig @192.168.32.32 www.uoregon.edu
>
> ; <<>> DiG 9.8.0-RedHat-9.8.0-4.uopel5 <<>> @192.168.32.32 www.uoregon.edu
> ; (1 server found)
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14280
> ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 6, ADDITIONAL: 4
>
> ;; QUESTION SECTION:
> ;www.uoregon.edu.               IN      A
>
> ;; ANSWER SECTION:
> www.uoregon.edu.        600     IN      CNAME   uowc-www.uoregon.edu.
> uowc-www.uoregon.edu.   86400   IN      A       192.168.142.125
>
> ;; AUTHORITY SECTION:
> uoregon.edu.            86400   IN      NS      phloem.uoregon.edu.
> uoregon.edu.            86400   IN      NS      bigdog.lsu.edu.
> uoregon.edu.            86400   IN      NS      sns-pb.isc.org.
> uoregon.edu.            86400   IN      NS      arizona.edu.
> uoregon.edu.            86400   IN      NS      ruminant.uoregon.edu.
> uoregon.edu.            86400   IN      NS      dns.cs.uoregon.edu.
>
> ;; ADDITIONAL SECTION:
> phloem.uoregon.edu.     86400   IN      A       192.168.32.35
> phloem.uoregon.edu.     86400   IN      AAAA    2001:468:d01:20::80df:2023
> ruminant.uoregon.edu.   86400   IN      A       192.168.60.22
> ruminant.uoregon.edu.   86400   IN      AAAA    2001:468:d01:3c::80df:3c16
>
> ;; Query time: 0 msec
> ;; SERVER: 192.168.32.32#53(192.168.32.32)
> ;; WHEN: Wed May 18 12:51:06 2011
> ;; MSG SIZE  rcvd: 300
>
>
> # dig @192.168.32.32 www.uoregon.edu
>
> ; <<>> DiG 9.8.0-RedHat-9.8.0-4.uopel5 <<>> @192.168.32.32 www.uoregon.edu
> ; (1 server found)
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34776
> ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
>
> ;; QUESTION SECTION:
> ;www.uoregon.edu.               IN      A
>
> ;; ANSWER SECTION:
> www.uoregon.edu.        600     IN      CNAME   uowc-www.uoregon.edu.
>
>
> As you can see, the second response does not include the AUTHORITY or
> the ADDITIONAL sections. This causes our users' machines to fail
> to resolve the A records because the resolver library does not query a
> second time. This second type of response appears to be the server
> acting as an authoritative-only server, not as a caching recursive server.
>
> Here are the most interesting details:
>
> - We have only observed this happening when querying the anycast
> address, not the address associated with the ethernet interface.
> - The behavior is independent of the network. We can replicate it by
> querying the anycast address from the server itself.
> - Our production (non-anycast) servers run the exact same version of
> Bind with the exact same configuration, and we have never observed this
> problem.
> - Bind's debugging output is exactly the same in both cases, so
> it offers no clues about the difference in responses.
> - Restarting Bind, the problem goes away for several hours. It requires
> the server to receive query traffic during those hours, otherwise the
> problem does not happen.
>
> Here's the options section of the config:
>
> options {
>   version "9999.9.9";
>   recursive-clients 5000;
>   directory "/etc/named";
>   allow-transfer { none; };
>   blackhole { attackers; };
>   listen-on-v6 { any; };
>   allow-recursion { customers; };
>   allow-query { any; };
>   dnssec-enable yes;
>   dnssec-validation yes;
>
> };
>
>
> Bind is listening on the anycast address (in addition to its NIC IP
> address):
>
> # netstat -lnp  |grep 192.168.32.32
> tcp        0      0 192.168.32.32:53            0.0.0.0:*
>    LISTEN      30771/named
> udp        0      0 192.168.32.32:53            0.0.0.0:*
>                30771/named
>
> These are the details of our Bind daemon (custom-built RPM, based on
> Fedora's source RPM):
>
> # named -V
> BIND 9.8.0-RedHat-9.8.0-4.uopel5 built with
> '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu'
> '--target=x86_64-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr'
> '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin'
> '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include'
> '--libdir=/usr/lib64' '--libexecdir=/usr/libexec'
> '--sharedstatedir=/usr/com' '--mandir=/usr/share/man'
> '--infodir=/usr/share/info' '--with-libtool' '--localstatedir=/var'
> '--enable-threads' '--enable-ipv6' '--with-pic' '--disable-static'
> '--disable-openssl-version-check' '--enable-exportlib'
> '--with-export-libdir=/usr/lib64'
> '--with-export-includedir=/usr/include'
> '--includedir=/usr/include/bind9' 'build_alias=x86_64-redhat-linux-gnu'
> 'host_alias=x86_64-redhat-linux-gnu'
> 'target_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall
> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
> --param=ssp-buffer-size=4 -m64 -mtune=generic' 'CPPFLAGS=
> -DDIG_SIGCHASE' 'CXXFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> -mtune=generic' 'FFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> -mtune=generic'
> using OpenSSL version: OpenSSL 0.9.8e-rhel5 01 Jul 2008
> using libxml2 version: 2.6.26
>
> # uname -a
> Linux adns1 2.6.18-238.9.1.el5 #1 SMP Fri Mar 18 12:42:39 EDT 2011
> x86_64 x86_64 x86_64 GNU/Linux
>
> # cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 5.6 (Tikanga)
>
>
> I would really appreciate any help with this.
>
> Thanks in advance,
> _______________________________________________
> dns-operations mailing list
> dns-operations at lists.dns-oarc.net
> https://lists.dns-oarc.net/mailman/listinfo/dns-operations
> dns-jobs mailing list
> https://lists.dns-oarc.net/mailman/listinfo/dns-jobs
>



More information about the dns-operations mailing list