[dns-operations] Intermittent failure on slave zone
Kristian Vilmann
kristian.vilmann at agillic.com
Tue Mar 1 06:57:03 UTC 2022
I'm stuck here.
I have Bind 9.16 configured on Ubuntu 20.04. The idea is for it to act
as recursor and cache for all servers on the internal network. Also it
is configured as secondary for internal zones. The primary nameserver
does not not recieve any queries from clients - it's a hidden master.
SOA records point to the secondary nameserver.
Also I have a caching nameserver to handle queries to the internet.
All external queries work. At times it can be rather busy handling
between 600000 to 1000000 requests over 5 minutes with no problems.
But queries on internal domains fail from from time to time and I have a
hard time figuring out why.
rndc dumpdb -zones shows the internal zones - all hosts are there.
Changes to a zone on the master server is visible on the secondary
server after the zone transfers.
The setup:
Master server (Hidden,internal zones) 10.100.10.7
|
|
Secondary (recursor, cache, Internal zones) 10.100.10.32
|
|
Cache 10.100.10.34
|
|
Internet
Only the secondary is known by the servers.
Config on secondary:
acl "myservers" {
10.0.0.0/8;
};
options {
directory "/var/cache/bind";
forwarders {
10.100.10.34;
};
dnssec-validation auto;
recursion yes;
empty-zones-enable no;
allow-recursion {
localhost;
agillicservers;
};
listen-on port 53 {
localhost;
0.0.0.0;
};
allow-query {
localhost;
myservers;
};
allow-transfer {
none;
};
};
Logging is configured:
logging {
channel b_log {
file "/var/log/named/bind.log" versions 20 size 20m;
print-time yes;
print-category yes;
print-severity yes;
severity debug 3;
};
channel b_query {
file "/var/log/named/query.log" versions 20 size 100m;
print-time yes;
severity debug 3;
};
category default { b_log; };
category config { b_log; };
category queries { b_query; };
};
Slave config:
zone "int.myzone.eu" {
type slave;
file "int.myzone.eu.zone";
masters {
10.100.10.7;
};
allow-transfer {
10.100.10.7;
};
};
zone "myzone.eu" {
type slave;
file "myzone.eu.zone";
masters {
10.100.10.7;
};
allow-transfer {
10.100.10.7;
};
};
Most of the time it works but once or twice during the day suddenly a
query fails for a while. Maybe 15 seconds - maybe a minute. I'm not sure
how long time it takes before it works again. It could be a query for
influx.int.myzone.eu - an internal host all the servers use all the time.
We have extensive logging on applications that rely on DNS, so errors
are visible almost immediately. But even if I'm actively monitoring the
errors, I cannot reproduce the error with dig on the commandline - which
makes sense, since queries again are getting the correct response after
a very short while.
Often I see subsequent queries for influx.int.myzone.eu.myzone.eu. That
makes sense, but I cannot figure out why it fails in the first place. I
see nothing in the logs. It happens also when the secondary server is
almost idle, so I doubt it has anything to do with load.
As far as I can see, requests to the internal zones are not cached. It
makes sense since the secondary server has the zone in memory already.
Is there an error log I haven't discovered yet? Any pointers are much
appreciated.
Best regards!
More information about the dns-operations
mailing list