[dns-operations] DNS .com/.net resolution problems in the Asia/Pacific region

Thu Jul 13 21:21:00 UTC 2023

Hi,

On Thu, Jul 13, 2023 at 1:18 PM Viktor Dukhovni <ietf-dane at dukhovni.org>
wrote:

> On Thu, Jul 13, 2023 at 12:16:37PM -0700, Gavin McCullagh wrote:
>
> > When faced with ~4x obviously bogus, broken nameservers (the stale pop)
> and
> > ~9x fresh working nameservers with valid signatures, the DNSSEC RFCs
> appear
> > to specify (and Unbound appears to implement) that resolvers must accept
> > and cache the obviously bogus (expired rrsig) answers and return SERVFAIL
> > to clients until those bogus answers expire (apparently 24 hours later in
> > this case?), rather than immediately considering those responses invalid,
> > those nameservers as lame and retrying against another - assuming this
> > response is either coming from an attacker or from a broken nameserver.
>
> I don't believe that's a valid reading of the DNSSEC RFCs.  Bogus
> answers are not cached by the resolver I'm working on these days, beyond
> the usual query rate limits for failures.
>
> And to the extend that a only some servers have stale data, I am also
> working to make sure that we'll try go get a better answer from another
> server.  Some resolvers already do.
>

Good to hear, thanks.  The text seems open to the interpretation that a
response with bad RRSIGs should result in RCODE 2
https://datatracker.ietf.org/doc/html/rfc4035#section-5.5 :

   If for whatever reason none of the RRSIGs can be validated, the
   response SHOULD be considered BAD.  If the validation was being done
   to service a recursive query, the name server MUST return RCODE 2 to
   the originating client.  However, it MUST return the full response if
   and only if the original query had the CD bit set.  Also see
Section <https://datatracker.ietf.org/doc/html/rfc4035#section-4.7>
   4.7 <https://datatracker.ietf.org/doc/html/rfc4035#section-4.7> on
caching responses that do not validate.

Do you think it's worth clarifying?   Or maybe I am just looking at the
wrong RFC?

The impact, seemingly to unbound and likely at least one other
implementation, seems to suggest that either a) those resolvers were not
retrying or b) they were retrying but not often enough to reliably get past
4/13 stale nameservers.  The logs indicate this was being cached too.   The
only mitigation was basically to disable validation and restart.

 Jul 10 07:59:48 host2 unbound: [36427:1] info: validation failure
<domain.in.question.xxxx.com A IN>: key for validation xxxx.com
<http://amazonaws.com/>. is marked as invalid because of a previous
validation failure <some.other.xxxx.com. A IN>: signature expired from
192.26.92.30 for DS xxxx.com <http://amazonaws.com/>. while building
chain of trust

I suspect the fact the impact was mild, was primarily because many do not
enable DNSSEC validation in critical places today.  At least one big
provider who had validation enabled had to re-route away from the bad PoP
(awesome they could do it, but that's not a reasonable expectation of
everyone).   Had validation been widely enabled, I suspect this would have
been a significant event.

We've experienced a stale anycast pop over the years.  It's not an easy
problem to completely avoid, so I'm sympathetic and think we should try to
be resilient to it.

Within reasonable retry limit counts and error response
> hold-downs, bad/stale/... data from a single server should not be final
> whether it is DNSSEC-related or not.
>

I see.  We had guessed that the stale answer was being treated as valid,
cacheable authoritative data, similar to how an NXDOMAIN would be - and a
resolver would not retry.  If that's not the case, that's reassuring, but
it would be great to make sure the rfcs and implementations all agree on it.

Gavin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20230713/336397df/attachment.html>