[dns-operations] dynect.net outage
edmonds at mycre.ws
Mon May 30 06:34:37 UTC 2022
Ralf Weber wrote:
> On 30 May 2022, at 1:12, Robert Edmonds wrote:
> > Simon Arlott via dns-operations wrote:
> >> I currently have this cached list of nameservers for dynect.net:
> >> ;; AUTHORITY SECTION:
> >> dynect.net. 14931 IN NS cgydc01dnsext01.us.oracle.com.
> >> dynect.net. 14931 IN NS tvp02dnsext02.tvp.oracle.com.
> >> dynect.net. 14931 IN NS sydc01dns03.au.oracle.com.
> >> dynect.net. 14931 IN NS trdc01dnsext01.us.oracle.com.
> >> dynect.net. 14931 IN NS adc08dnsext02.us.oracle.com.
> >> dynect.net. 14931 IN NS rmdc02dnsext01.us.oracle.com.
> >> dynect.net. 14931 IN NS llg07dnsext02.llg.oracle.com.
> >> dynect.net. 14931 IN NS llg07dnsext01.llg.oracle.com.
> >> dynect.net. 14931 IN NS iad-dns-master.oraclecorp.com.
> >> dynect.net. 14931 IN NS adc08dnsext01.us.oracle.com.
> >> dynect.net. 14931 IN NS rmdc02dnsext02.us.oracle.com.
> >> ;; WHEN: Fri May 27 17:10:08 BST 2022
> >> All of these hostnames are NXDOMAIN in the oracle.com/oraclecorp.com
> >> zones. Looks like someone has reconfigured the nameservers for
> >> dynect.net and then immediately pulled the A/AAAA records for the old
> >> names without waiting out the TTL on the old NS records.
> > This was https://www.dynstatus.com/incidents/1xlbp98xr3y2.
> So how do you expect the domain to be resolved if all of your out
> of bailiwick name server names no longer point to an IP address?
By using the working nameservers with resolvable names specified in the
delegation from the parent zone, which never changed in this particular
case. This is what Unbound's resolution algorithm does if there are not
too many nonexisting nameserver target names in the child's NS RRset,
and what other resolver algorithms do.
> >> Unbound gives up and returns SERVFAIL for anything using dynect.net
> >> because it exceeds the maximum number of NXDOMAIN responses for
> >> nameserver hostnames.
> Maybe this is happening where you still have the A/AAAA record
> cached for delegation, but you can’t rely on that. If a domain is
> not being able to be resolved from a cold/empty cache it is broken,
> and the domain owner has to deal with the consequences. End of story.
There is more than one resolver implementation, and they differ in the
results of resolving a zone with this type of misconfiguration, and none
of them are the reference implementation of DNS. So just looking at a
particular resolver algorithm returning SERVFAIL when encountering a
particular data pattern starting from a cold cache cannot tell us
whether the algorithm or the data is at fault.
RFC 1034 (section 5.3.3) recommends that the resolver implementer
prioritize the following:
1. Bound the amount of work (packets sent, parallel processes
started) so that a request can't get into an infinite loop or
start off a chain reaction of requests or queries with other
implementations EVEN IF SOMEONE HAS INCORRECTLY CONFIGURED
2. Get back an answer if at all possible.
3. Avoid unnecessary transmissions.
4. Get the answer as quickly as possible.
(This list appears to be in order of most important to least important.
Amusingly, "Get the answer as correctly as possible" is not on the
This particular case seems to be a straight-forward trade-off between
#1, #2 and #3.
More information about the dns-operations