[dns-operations] DNS load-balancing/failover using an ASR 9xxx (few questions)

Fri Aug 15 09:38:28 UTC 2014

We do the same with Quagga or BIRD on Linux and ospf daemon for georedundancy and load sharing with proximity for customer access to recursive bind resolvers.

avoiding tedious specific need in our case, we have the primary/secondary DNS IPs announced as loopback by the system.
We don't have any specific monitoring to bring OSFP down on the server since we have lots of them (4 per POP) and specific scripted and human monitoring 24x7, so if a server has issue the customer barely notices it before the human acts and bring down the server affected.
we also had power surge in a POP that brought it offline entirely on DNS side (network was on dc while problem affected ac power only for some racks), and 30 seconds after the service was up again using dnses of another pop. very effective given the giant fail we had.

about the timeouts, you don't need to wait if you bring down the loopbacks instead of the ospf daemon. after downing the loopbacks the ospf notifies he does not have those IPs anymore and upstream routers load share only on remaining servers.
then you can shut the daemon down.

I wondered if using the probe, but found the it was an overkill in our case since a simple transient hang in the network (STP issue, mismatched cabling) could have brought down an entire POP for a minor event. We preferred to have human monitoring instead since a 24x7 service was already there for network alarms and could easily correlate with other causes or real server issue.

We didn't had a single sw failure in more then 7 years with four different installations (RHEL 3, Centos 4,5,6) in a very complex environment due to efficency and law constraints (we have upstream DNS providing DNS poisoning for law requirement and a shared caching for all the anycast dnses).

Ciao,
A.

Il giorno 15/ago/2014, alle ore 09:46, "Anand Buddhdev" <anandb at ripe.net> ha scritto:

> On 15/08/2014 00:00, Nat Morris wrote:
>
>> BGP sessions between the ASR 9xxxx and each DNS server in the cluster,
>> ExaBGP running on them announcing their loopback/service /32 + /128
>> address(es).
>>
>> Health check scripts on each service to probe for service ability,
>> retract the announcement upon failure.
>
> We are doing this exact same thing on many RIPE NCC DNS servers, and it
> works very well. The other advantage of BGP is that as soon as you
> withdraw the announcement, the router stops sending traffic to the
> server. With OSPF, you have timeouts of several seconds before traffic
> stops arriving at a dead server.
>
> Regards,
>
> Anand
> _______________________________________________
> dns-operations mailing list
> dns-operations at lists.dns-oarc.net
> https://lists.dns-oarc.net/mailman/listinfo/dns-operations
> dns-jobs mailing list
> https://lists.dns-oarc.net/mailman/listinfo/dns-jobs

CONFIDENTIAL: This E-mail and any attachment are confidential and may contain reserved information. If you are not one of the  named recipients, please notify the sender immediately. Moreover, you should not disclose the contents to any other person, or should the information contained be used for any purpose or stored or copied in any form.