[dns-operations] Monitoring anycast nodes for automatic route withdrawal

Alexander Gall gall at switch.ch
Mon Aug 9 15:47:13 UTC 2010


On Wed, 4 Aug 2010 14:16:12 -0400, David Coulthart <davec at columbia.edu> said:

> Recent outages of one of our caching recursive nameservers have reminded me how sensitive clients are when their primary nameserver is unavailable, even if the second configured nameserver is available & can easily handle the load.  This has me thinking about finally biting the bullet & convincing management to give me the time to implement anycast.  But in all my reading about configuring anycast, I've seen little discussion of how real-world implementations deal with monitoring for node failure.  The only relevant info I've found so far are a Linux Journal article [1] suggesting to execute a local test of the DNS service on each anycast node & a LUG presentation from 2006 [2] suggesting to only test that the named process is running so that a DoS attack only affects that node.
> Would any of you doing this in the real world be willing to describe your monitoring setups?  Some questions I've thought of so far:

We use BGP (Quagga) to announce the anycast address to the adjacent
routers and distribute them into the IGP (OSPFv2/v3) from there there
(currently, we have two anycast addresses served by two physical nodes
each).  I've used OSPF directly on the host before, but it turned out
to be too difficult to avoid side-effects.

The anycast addresses are configured on the loopback interface.  I use
"ifconfig up/down" to control BGP announcements (actually, for IPv6 on
Linux you have to add/delete the addresses because you can't configure
IPv6 addresses on separate loopback subinterfaces).

>   * How frequently do you check that a node is responding?

In intervals of 4 seconds.

>   * Are your checks controlled via a central monitoring system or do they operate independently on each node (e.g., via cron)?

Independently on each node.  This is done by a perl script that
implements a simple finite state machine and runs continously.  A
general-purpose monitoring mechanism is used to check whether this
process is running.

>   * What criteria do you use to decide whether to announce/withdraw the route?
- Loopback interface associated with the anycast address must be up
- Anycast address must be pingable locally
- A test query for a name in a zone used for this purpose with TTL 0
  must succeed.  I use three different queries in independent zones
  served by distinct authoritative servers.  The cache is considered
  to be working if at least one of these queries succeeds.

In case of a failure, the system tries to recover by
reconfiguring/restarting the cache and resetting the loopback
interfaces.

>   * Do you take any special precautions to protect against route flapping due to a misbehaving node?  

The system uses an exponential backoff when trying to recover from a
failure.

The script currently supports BIND and unbound on Linux and Solaris
but it should be easy to add support for other systems.  Feel free to
contact me if you'd like to play with it.  I've never released it to
the public but it should be in a decent enough shape to not be
embarassing for me :) There is even a package for Debian/GNU Linux.

Unfortunately, my documentation is on an internal Wiki, but I've
attached a PDF of it as well as of the manpage for the script itself.

I've been using it for years and so far it has worked really well.

BTW, I also found it useful to tweak the options in /etc/resolv.conf.
We use two "namserver" declarations for our two anycast nodes.  The
options

options timeout:1 rotate

make sure that there are no long timeouts and queries are sent to the
next server in the list for each query.  This will help in the worst
case when one broken anycast instance is not properly withdrawn and
blackholes one of the addresses.

-- 
Alex

-------------- next part --------------
A non-text attachment was scrubbed...
Name: anycast-mon.pdf
Type: application/pdf
Size: 111403 bytes
Desc: not available
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20100809/07c52f28/attachment.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: anycast-manpage.pdf
Type: application/pdf
Size: 76706 bytes
Desc: not available
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20100809/07c52f28/attachment-0001.pdf>


More information about the dns-operations mailing list