[dns-operations] Monitoring anycast nodes for automatic route withdrawal

Wed Aug 4 18:16:12 UTC 2010

Recent outages of one of our caching recursive nameservers have reminded me how sensitive clients are when their primary nameserver is unavailable, even if the second configured nameserver is available & can easily handle the load.  This has me thinking about finally biting the bullet & convincing management to give me the time to implement anycast.  But in all my reading about configuring anycast, I've seen little discussion of how real-world implementations deal with monitoring for node failure.  The only relevant info I've found so far are a Linux Journal article [1] suggesting to execute a local test of the DNS service on each anycast node & a LUG presentation from 2006 [2] suggesting to only test that the named process is running so that a DoS attack only affects that node.

Would any of you doing this in the real world be willing to describe your monitoring setups?  Some questions I've thought of so far:

  * How frequently do you check that a node is responding?
  * Are your checks controlled via a central monitoring system or do they operate independently on each node (e.g., via cron)?
  * What criteria do you use to decide whether to announce/withdraw the route?
  * Do you take any special precautions to protect against route flapping due to a misbehaving node?  

I would be very grateful for answers to any of these questions or any other guidance or pointers to references for implementing a reliable anycast DNS infrastructure.

Thanks,
Dave Coulthart

1. http://www.linuxjournal.com/magazine/ipv4-anycast-linux-and-quagga
2. http://www.linuxsa.org.au/meetings/2006-07/anycast-dns.pdf