[dns-operations] DNS zone monitoring

Mon Jun 14 06:19:33 UTC 2010

Am Sun, 13 Jun 2010 23:29:22 -0500 (CDT)
schrieb Joe Greco <jgreco at ns.sol.net>:

> > On 2010-06-13, at 22:56, Joe Greco wrote:
> > 
> > > I was just in a discussion elsewhere that brought up an old topic:
> > > 
> > > How do people monitor for secondary servers that are having
> > > trouble updating a zone from the master?
> > 
> > We direct an apex/IN/SOA query to all servers for each zone we are 
> > checking, and if we see inconsistent serial numbers we sound alarms.
> 
> Yes, but that's only useful if your SOA's are changing.  For many
> zones, there's no need for the serials to change.  Besides, I already
> indicated we did that.  :-)
> 
> > > Obviously, we do all the normal sanity checks (SOA's match, etc)
> > > but other than monitoring the log file and watching for errors
> > > such as
> > 
> > If SOA serials match then no zone transfers will happen and you
> > have no errors to look for.
> 
> False; I'm *specifically* interested in a case where that's clearly
> not the case, which would be a secondary which has become unable to
> reach its master (think: because of a firewall rule, etc., that was
> inadvertently and sloppily applied).  In such a case, the secondary
> will wait for possibly as long as an entire $refresh period before
> initiating a check, but once that happens and it cannot reach the
> master, you only get $expire more seconds.  This means that your zone
> lives for something between $expire and ($refresh + $expire) before
> going all kablooey, and in the meantime serving either outdated or
> current information, depending on whether or not the master's data
> has been twiddled.
> 
> In the case the master's data is updated, that failure is detectable,
> given reasonable polling and SOA values, but what about the case where
> the zone isn't changing?  It seems to be working... seems to be
> working.. then suddenly falls off a cliff when it expires and is
> suddenly not working.  That's a train wreck if your nameservers all
> go in short order and stop serving a zone.
> 
> Relying on monitoring the logfiles for such failures seems like a
> kludge, but I'm not seeing the alternative.  In particular, I cannot
> think of any way in which the remaining expiration period for a zone
> is usefully exposed, say something along the lines of what a TTL is.
> 
> > If you want to test the transfer machinery, then arrange for the
> > SOA serial number to be increased regularly and look for signs that
> > the updated zones are not being served where they should be.
> > Depending on the signature validity periods and re-signing
> > intervals you choose, simply signing your zones might be enough to
> > provide SOA serial increases sufficient to do useful monitoring.
> 
> That strikes me as ugly and dangerous in some ways (though possibly
> offset as an improvement in others); it's akin to writing all your
> web pages in PHP even though most of them could be served up with a
> static page.  You introduce other possible failures into the mix.
> 
> I thought maybe I was missing something, but it seems to me (and I've
> spent a little time looking now) that there's no real method for this
> and nobody's really doing it.  I guess I should just not be so
> paranoid and let this bug me for another ten years...?  Heh.
> 
> ... JG

So how about initiating rndc retransfers on your slaves every
$refresh/10? That way you would see timeouts etc. earlier than $refresh
or $expire.

Ciao
Torsten