[dns-operations] Monitoring anycast nodes for automatic route withdrawal

Wed Aug 4 19:52:18 UTC 2010

On 08/04/2010 02:16 PM, David Coulthart wrote:
> Recent outages of one of our caching recursive nameservers have reminded me how sensitive clients are when their primary nameserver is unavailable, even if the second configured nameserver is available & can easily handle the load.  This has me thinking about finally biting the bullet & convincing management to give me the time to implement anycast.  But in all my reading about configuring anycast, I've seen little discussion of how real-world implementations deal with monitoring for node failure.  The only relevant info I've found so far are a Linux Journal article [1] suggesting to execute a local test of the DNS service on each anycast node & a LUG presentation from 2006 [2] suggesting to only test that the named process is running so that a DoS attack only affects that node.
> 
> Would any of you doing this in the real world be willing to describe your monitoring setups?  Some questions I've thought of so far:
> 
>   * How frequently do you check that a node is responding?
>   * Are your checks controlled via a central monitoring system or do they operate independently on each node (e.g., via cron)?

We use a local health check in addition to external monitoring to ensure
we get both perspectives. The local check can down the interface and
withdrawal the route if it suspects the name server process is
misbehaving. The external check just throws an alert if it suspects
there's something is wrong and a human follows up on it. This method
prevents contention between the local and remote checks. The local check
is a daemon (shell script in a while loop) that runs constantly, the
external checks are on a 2 minute interval.

>   * What criteria do you use to decide whether to announce/withdraw the route?

We check for a record in a low TTL zone we host on other servers within
similar infrastructure. This ensures we're not only testing the cache
but also the proper fetching of the record. If for whatever reason the
query fails repeatedly the local and remote checks flag the system as
failed.

>   * Do you take any special precautions to protect against route flapping due to a misbehaving node?  

There is a 60 second delay from the time the system begins to respond
normally to the queries to when the ospf interface is brought back to an
"up" state. This has worked well for us as flap prevention method.

> I would be very grateful for answers to any of these questions or any other guidance or pointers to references for implementing a reliable anycast DNS infrastructure.

We too have found that if the primary resolver is not functioning it's
almost useless to have a secondary resolver listed. There is such an
impact on the systems (or clients) doing the resolution they believe
there is an outage. You can adjust this resolution timeout but only if
you have control of the resolving client.

In our setups we use FreeBSD and the disc interface driver. This permits
us to keep the anycast'ed address(es) on a virtual interface allowing
up/down'ing of the interface itself to facilitate route withdrawals
while keeping management to the nameserver itself unaffected by anything
a script is going to be executing.

There's been other articles that use the IP SLA feature to perform a
"health check" of sorts. I dislike that method because it requires a
specific type of hardware and doesn't check everything we're interested
in. There's more to a healthy system than just reachability.

Regards,

	Chris