[dns-operations] BIND, Knot and NSD behaviour when serial number goes backwards

Sun Feb 19 11:27:04 UTC 2017

Hello folks,

We run a mixture of BIND, Knot and NSD on our name servers, and
sometimes this offers the opportunity to witness their different
behaviours for corner cases. We've had one today. The serial number of a
zone that we slave went backwards, from 2017021712 to 2017021701.

Here's what BIND 9.10 did:

...
...
19-Feb-2017 06:33:04.547 general: zone va/IN/main: serial number
(2017021701) received from master 193.0.19.190#53 < ours (2017021712)
19-Feb-2017 07:25:46.495 general: zone va/IN/main: expired
19-Feb-2017 07:25:46.559 general: zone va/IN/main: Transfer started.
19-Feb-2017 07:25:46.572 general: zone va/IN/main: transferred serial
2017021701: TSIG 'main.ripe.net'

Here's what Knot 2.3 did:

...
...
2017-02-18T20:59:21 info: [va.] refresh, outgoing, 193.0.19.190 at 53: zone
is up-to-date
2017-02-19T00:59:21 info: [va.] refresh, outgoing, 193.0.19.190 at 53: zone
is up-to-date
2017-02-19T04:59:21 info: [va.] refresh, outgoing, 193.0.19.190 at 53: zone
is up-to-date
2017-02-19T08:59:21 info: [va.] refresh, outgoing, 193.0.19.190 at 53: zone
is up-to-date

And here's what NSD 4.1 did:

...
...
[2017-02-19 07:44:52.590] nsd[4756]: info: xfrd: zone va. ignoring old
serial from 193.0.19.190
[2017-02-19 07:44:52.590] nsd[4756]: info: xfrd: zone va. bad transfer 0
from 193.0.19.190
[2017-02-19 07:44:55.660] nsd[4756]: error: xfrd: zone va. has expired

When BIND sees a lower serial number, it ignores it, and considers that
a failure to refresh, and retries using the retry timer (which is 1 hour
for this zone). Eventually, it expires the zone, and then pragmatically
ignores the zone content and retransfers it, and recovers.

Knot just thinks there's nothing to do, and happily chugs along,
blissfully unaware.

NSD, like BIND, ignores the lower serial number, and keeps trying to
refresh, but with an somewhat more irregular schedule (I think it
deliberately slows down, to avoid DoSsing the master in case the retry
timer was too small). Eventually, it gives up and expires the zone, but
does not attempt to retransfer. It starts to SERVFAIL.

As an operator, in order to fix this, I have to force a transfer for
both Knot and NSD, like this:

knotc zone-retransfer va.
nsd-control force_transfer va.

BIND's behaviour here is the most pragmatic, because it recovers
automatically. NSD's behaviour is also fine, in my opinion, because this
really is an error condition that requires some intervention. Knot's
behaviour is probably the worst of the three, because it is blissfully
unaware of the problem.

The plusses and minuses of these behaviours can of course be debated,
and I'm sure there would be many opinions. I personally prefer the NSD
behaviour. BIND's is also okay, but it sort of hides the problem (only
visible if you look at logs). Knot's behaviour is probably the worst.
I'll open an issue and see what its developers think.

Regards,
Anand Buddhdev
RIPE NCC