[dns-operations] dynect.net outage

Mon May 30 06:34:37 UTC 2022

Ralf Weber wrote:
> Moin!
> 
> On 30 May 2022, at 1:12, Robert Edmonds wrote:
> > Simon Arlott via dns-operations wrote:
> >> I currently have this cached list of nameservers for dynect.net:
> >>
> >> ;; AUTHORITY SECTION:
> >> dynect.net.		14931	IN	NS	cgydc01dnsext01.us.oracle.com.
> >> dynect.net.		14931	IN	NS	tvp02dnsext02.tvp.oracle.com.
> >> dynect.net.		14931	IN	NS	sydc01dns03.au.oracle.com.
> >> dynect.net.		14931	IN	NS	trdc01dnsext01.us.oracle.com.
> >> dynect.net.		14931	IN	NS	adc08dnsext02.us.oracle.com.
> >> dynect.net.		14931	IN	NS	rmdc02dnsext01.us.oracle.com.
> >> dynect.net.		14931	IN	NS	llg07dnsext02.llg.oracle.com.
> >> dynect.net.		14931	IN	NS	llg07dnsext01.llg.oracle.com.
> >> dynect.net.		14931	IN	NS	iad-dns-master.oraclecorp.com.
> >> dynect.net.		14931	IN	NS	adc08dnsext01.us.oracle.com.
> >> dynect.net.		14931	IN	NS	rmdc02dnsext02.us.oracle.com.
> >> ;; WHEN: Fri May 27 17:10:08 BST 2022
> >>
> >> All of these hostnames are NXDOMAIN in the oracle.com/oraclecorp.com
> >> zones. Looks like someone has reconfigured the nameservers for
> >> dynect.net and then immediately pulled the A/AAAA records for the old
> >> names without waiting out the TTL on the old NS records.
> >
> > This was https://www.dynstatus.com/incidents/1xlbp98xr3y2.
> So how do you expect the domain to be resolved if all of your out
> of bailiwick name server names no longer point to an IP address?

By using the working nameservers with resolvable names specified in the
delegation from the parent zone, which never changed in this particular
case. This is what Unbound's resolution algorithm does if there are not
too many nonexisting nameserver target names in the child's NS RRset,
and what other resolver algorithms do.

> >> Unbound gives up and returns SERVFAIL for anything using dynect.net
> >> because it exceeds the maximum number of NXDOMAIN responses for
> >> nameserver hostnames.
> Maybe this is happening where you still have the A/AAAA record
> cached for delegation, but you can’t rely on that. If a domain is
> not being able to be resolved from a cold/empty cache it is broken,
> and the domain owner has to deal with the consequences. End of story.

There is more than one resolver implementation, and they differ in the
results of resolving a zone with this type of misconfiguration, and none
of them are the reference implementation of DNS. So just looking at a
particular resolver algorithm returning SERVFAIL when encountering a
particular data pattern starting from a cold cache cannot tell us
whether the algorithm or the data is at fault.

RFC 1034 (section 5.3.3) recommends that the resolver implementer
prioritize the following:

   1. Bound the amount of work (packets sent, parallel processes
      started) so that a request can't get into an infinite loop or
      start off a chain reaction of requests or queries with other
      implementations EVEN IF SOMEONE HAS INCORRECTLY CONFIGURED
      SOME DATA.

   2. Get back an answer if at all possible.

   3. Avoid unnecessary transmissions.

   4. Get the answer as quickly as possible.

(This list appears to be in order of most important to least important.
Amusingly, "Get the answer as correctly as possible" is not on the
list.)

This particular case seems to be a straight-forward trade-off between
#1, #2 and #3.

-- 
Robert Edmonds