[dns-operations] DNS .com/.net resolution problems in the Asia/Pacific region

Thu Jul 13 19:16:37 UTC 2023

On Wed, Jul 12, 2023, 5:28 PM Olafur Gudmundsson <ogud at ogud.com> wrote:

>
>
> On Jul 11, 2023, at 8:24 PM, Gavin McCullagh <gmccullagh at gmail.com> wrote:
>
> That is true of course, but the magnitude of this event was made much
> worse by dnssec.  The entire COM and NET zones being bogus (including the
> unsigned delegations) is very different to the few that saw record changes
> in the prior 1-2 days.
>
>
> As explained in other messages this is not true due to the rolling expiry
> but on the other hand DNSSEC may have exposed the problem much sooner than
> otherwise, I once helped a TLD operator to find a single anycast instance
> that they assumed was off line for the few weeks but it was still serving
> an expired zone.
>

Agreed, the "entire zone" isn't true (thanks for explaining), but the
magnitude being "much worse" does still seem true.  Maybe only 25% of the
zone went bogus, but that's surely much larger than the domains that had
changed during the time.   I guess it is true that we will discover more
severe outages quicker than less severe ones, but I'm not sure that
justifies making events more severe.  A stale COM pop pre-dnssec was
harmful to O(modified domains).   Under DNSSEC, a stale pop is harmful to
O(all domains) [25%].  It's unclear that it needs to be that way.

When faced with ~4x obviously bogus, broken nameservers (the stale pop) and
~9x fresh working nameservers with valid signatures, the DNSSEC RFCs appear
to specify (and Unbound appears to implement) that resolvers must accept
and cache the obviously bogus (expired rrsig) answers and return SERVFAIL
to clients until those bogus answers expire (apparently 24 hours later in
this case?), rather than immediately considering those responses invalid,
those nameservers as lame and retrying against another - assuming this
response is either coming from an attacker or from a broken nameserver.

Obviously we need to be careful about creating retry storms, but that retry
storm is pretty much equivalent to what would happen if that PoP were to
return SERVFAIL or not respond, so that doesn't seem like a new problem.

I assume lots of us on this mailing list operate authoritative dns
servers.  When one of our PoPs or nameservers is unresponsive, most of us
rely on retries against other nameservers (aka PoPs) to ensure this is a
non-event.  When one of our PoPs is serving stale, it is varying levels of
bad depending on how stale it is.  But under DNSSEC stale RRSIGs are
obvious and it seems like a stale server could be a non-event, if the
resolver is allowed to retry.

Gavin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20230713/598d9eeb/attachment.html>