[dns-operations] BIND, Knot and NSD behaviour when serial number goes backwards

Mon Feb 20 11:31:46 UTC 2017

Anand,

[ apologies for the rambling reply ]

At 2017-02-19 12:27:04 +0100
Anand Buddhdev <anandb at ripe.net> wrote:

> We run a mixture of BIND, Knot and NSD on our name servers, and
> sometimes this offers the opportunity to witness their different
> behaviours for corner cases. We've had one today. The serial number of a
> zone that we slave went backwards, from 2017021712 to 2017021701.
> When BIND sees a lower serial number, it ignores it, and considers that
> a failure to refresh, and retries using the retry timer (which is 1 hour
> for this zone). Eventually, it expires the zone, and then pragmatically
> ignores the zone content and retransfers it, and recovers.

..snip... 

> Knot just thinks there's nothing to do, and happily chugs along,
> blissfully unaware.
> 
> NSD, like BIND, ignores the lower serial number, and keeps trying to
> refresh, but with an somewhat more irregular schedule (I think it
> deliberately slows down, to avoid DoSsing the master in case the retry
> timer was too small). Eventually, it gives up and expires the zone, but
> does not attempt to retransfer. It starts to SERVFAIL.

...snip...

> BIND's behaviour here is the most pragmatic, because it recovers
> automatically. NSD's behaviour is also fine, in my opinion, because this
> really is an error condition that requires some intervention. Knot's
> behaviour is probably the worst of the three, because it is blissfully
> unaware of the problem.
> 
> The plusses and minuses of these behaviours can of course be debated,
> and I'm sure there would be many opinions. I personally prefer the NSD
> behaviour. BIND's is also okay, but it sort of hides the problem (only
> visible if you look at logs). Knot's behaviour is probably the worst.
> I'll open an issue and see what its developers think.

BIND's behavior is correct, I think. There might be some details to
look at which masters are checked, but it implements the protocol as
designed, and should be workable for administrators.

Some discussion...

First, to be clear, we cannot rely on NOTIFY very much to help us. Not
all servers have NOTIFY configured, plus not all servers use TSIG to
secure NOTIFY (indeed I vaguely remember BIND 9 not supporting TSIG on
NOTIFY packets, although I see ways to configure it in the BIND 9 ARM
now). So while the serial version in a NOTIFY packet might be a
helpful hint, we have to solve the problem as if we did not get NOTIFY. 

I think there are two separate, related, behaviors here.

1. What do you do when your zone passes the EXPIRE timeout?
   According to RFC 1034: 

   "If the secondary finds it impossible to perform a serial check for
   the EXPIRE interval, it must assume that its copy of the zone is
   obsolete an discard it."

   I'm not exactly sure what "discard" means. Presumably that means that
   you should start responding with... REFUSED? Or SERVFAIL?

   Honestly I think that there is no good behavior here. In some cases
   you may prefer REFUSED or SERVFAIL if the zone is stale, in other
   cases you may prefer to serve the last known good version of the
   zone. (And of course eventually DNSSEC will break a zone.)

   I think the recommendation would be for a zone administrator that
   wants to keep serving a zone no matter what is just set EXPIRE to 68
   years.

2. How do you know if the serial goes backwards?
   Knowing if a serial has gone backwards in the general case is hard.

   One issue here is that you may have multiple masters, with different
   versions of the zone.

   Imagine you have two servers:

   time 0: server A updates to serial 10
   time 1: server A updates to serial 20
   time 2: server B updates to serial 10
   time 3: server A updates to serial 30
   time 4: server B updates to serial 20
   time 5: server B updates to serial 30

   If a slave is at version 30 and queries server B when it is at
   version 10, did the serial go backwards? From the point of view of
   server A... no. From the point of view of server B... no. 

   The only way to know if a serial has gone backwards is to check all
   masters. Even then, it may be problematic. For example, several root
   server clusters will return serials for the root zone for repeated
   queries. This is expected behavior in a load-balancing cluster, but
   since SOA queries are just "normal" queries. (Of course, this
   affects slaves looking for new versions, but eventually every node
   in the cluster will return a new version of the zone.)

My own take on it is that slaves should:

1. Try all masters until they find a new version of the zone, whenever
   they are refreshing. (This is just to get new versions as quickly as
   possible.)

2. Expire a zone after the EXPIRY timeout, but then immediately try to
   AXFR a version from all masters. Or perhaps try an AXFR beforehand
   and expire the zone slightly in advance to avoid ever having an
   expired zone.

If you are running a zone and really want your serial to go backwards,
you can do this in two steps by using the bizarre serial number
arithmetic and increasing the serial number twice. Or just force
transfer, like you are doing. :)

Cheers,

--
Shane
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20170220/182729d9/attachment.sig>