[dns-operations] Intermittent failure on slave zone

Tue Mar 1 06:57:03 UTC 2022

I'm stuck here.

I have Bind 9.16 configured on Ubuntu 20.04. The idea is for it to act 
as recursor and cache for all servers on the internal network. Also it 
is configured as secondary for internal zones. The primary nameserver 
does not not recieve any queries from clients - it's a hidden master. 
SOA records point to the secondary nameserver.

Also I have a caching nameserver to handle queries to the internet.

All external queries work. At times it can be rather busy handling 
between 600000 to 1000000 requests over 5 minutes with no problems.

But queries on internal domains fail from from time to time and I have a 
hard time figuring out why.

rndc dumpdb -zones shows the internal zones - all hosts are there. 
Changes to a zone on the master server is visible on the secondary 
server after the zone transfers.

The setup:

Master server (Hidden,internal zones) 10.100.10.7
      |
      |
Secondary (recursor, cache, Internal zones) 10.100.10.32
      |
      |
Cache 10.100.10.34
      |
      |
Internet

Only the secondary is known by the servers.

Config on secondary:

acl "myservers" {
         10.0.0.0/8;
};
options {
     directory "/var/cache/bind";
     forwarders {
          10.100.10.34;
     };
     dnssec-validation auto;
     recursion yes;
     empty-zones-enable no;
         allow-recursion {
                 localhost;
                 agillicservers;
         };
         listen-on port 53 {
                 localhost;
                 0.0.0.0;
         };
         allow-query {
                 localhost;
                 myservers;
         };
         allow-transfer {
                 none;
         };
};

Logging is configured:

logging {
         channel b_log {
                 file "/var/log/named/bind.log" versions 20 size 20m;
                 print-time yes;
                 print-category yes;
                 print-severity yes;
                 severity debug 3;
         };

         channel b_query {
                 file "/var/log/named/query.log" versions 20 size 100m;
                 print-time yes;
                 severity debug 3;
         };
         category default { b_log; };
         category config { b_log; };
         category queries { b_query; };
};

Slave config:

zone "int.myzone.eu"  {
         type slave;
         file "int.myzone.eu.zone";
         masters {
                 10.100.10.7;
         };
     allow-transfer {
                 10.100.10.7;
         };
};

zone "myzone.eu"  {
         type slave;
         file "myzone.eu.zone";
         masters {
                 10.100.10.7;
         };
     allow-transfer {
                 10.100.10.7;
         };
};

Most of the time it works but once or twice during the day suddenly a 
query fails for a while. Maybe 15 seconds - maybe a minute. I'm not sure 
how long time it takes before it works again. It could be a query for 
influx.int.myzone.eu - an internal host all the servers use all the time.

We have extensive logging on applications that rely on DNS, so errors 
are visible almost immediately. But even if I'm actively monitoring the 
errors, I cannot reproduce the error with dig on the commandline - which 
makes sense, since queries again are getting the correct response after 
a very short while.

Often I see subsequent queries for influx.int.myzone.eu.myzone.eu. That 
makes sense, but I cannot figure out why it fails in the first place. I 
see nothing in the logs. It happens also when the secondary server is 
almost idle, so I doubt it has anything to do with load.

As far as I can see, requests to the internal zones are not cached. It 
makes sense since the secondary server has the zone in memory already.

Is there an error log I haven't discovered yet? Any pointers are much 
appreciated.

Best regards!