[dns-operations] DNS .com/.net resolution problems in the Asia/Pacific region

Tue Jul 18 18:18:57 UTC 2023

Hi,

sorry to dredge this back up, but I just want to give anyone the chance to
object.

My read of what Viktor and others have indicated here is that, when a
validating resolver receives a response with expired rrsigs, it's okay (and
encouraged?) for that resolver to treat that as an invalid response and
retry against other nameservers, similarly to how it would handle a REFUSED
or SERVFAIL response from an authority (i.e. with similar care to limit
retry storms).

The purpose of this is so that a single stale pop or authoritative host
would not cause an outage to dnssec signed domains, as resolvers will retry
against others.

I'd like to reach out to NLNet about changing Unbound to do this, so I want
to make sure people have a chance to disagree.  Feel free to voice your
disagreement (and reasons) here if you do.

Gavin

On Thu, Jul 13, 2023 at 2:21 PM Gavin McCullagh <gmccullagh at gmail.com>
wrote:

> Hi,
>
> On Thu, Jul 13, 2023 at 1:18 PM Viktor Dukhovni <ietf-dane at dukhovni.org>
> wrote:
>
>> On Thu, Jul 13, 2023 at 12:16:37PM -0700, Gavin McCullagh wrote:
>>
>> > When faced with ~4x obviously bogus, broken nameservers (the stale pop)
>> and
>> > ~9x fresh working nameservers with valid signatures, the DNSSEC RFCs
>> appear
>> > to specify (and Unbound appears to implement) that resolvers must accept
>> > and cache the obviously bogus (expired rrsig) answers and return
>> SERVFAIL
>> > to clients until those bogus answers expire (apparently 24 hours later
>> in
>> > this case?), rather than immediately considering those responses
>> invalid,
>> > those nameservers as lame and retrying against another - assuming this
>> > response is either coming from an attacker or from a broken nameserver.
>>
>> I don't believe that's a valid reading of the DNSSEC RFCs.  Bogus
>> answers are not cached by the resolver I'm working on these days, beyond
>> the usual query rate limits for failures.
>>
>> And to the extend that a only some servers have stale data, I am also
>> working to make sure that we'll try go get a better answer from another
>> server.  Some resolvers already do.
>>
>
> Good to hear, thanks.  The text seems open to the interpretation that a
> response with bad RRSIGs should result in RCODE 2
> https://datatracker.ietf.org/doc/html/rfc4035#section-5.5 :
>
>    If for whatever reason none of the RRSIGs can be validated, the
>    response SHOULD be considered BAD.  If the validation was being done
>    to service a recursive query, the name server MUST return RCODE 2 to
>    the originating client.  However, it MUST return the full response if
>    and only if the original query had the CD bit set.  Also see Section <https://datatracker.ietf.org/doc/html/rfc4035#section-4.7>
>    4.7 <https://datatracker.ietf.org/doc/html/rfc4035#section-4.7> on caching responses that do not validate.
>
>
> Do you think it's worth clarifying?   Or maybe I am just looking at the
> wrong RFC?
>
> The impact, seemingly to unbound and likely at least one other
> implementation, seems to suggest that either a) those resolvers were not
> retrying or b) they were retrying but not often enough to reliably get past
> 4/13 stale nameservers.  The logs indicate this was being cached too.   The
> only mitigation was basically to disable validation and restart.
>
>  Jul 10 07:59:48 host2 unbound: [36427:1] info: validation failure <domain.in.question.xxxx.com A IN>: key for validation xxxx.com <http://amazonaws.com/>. is marked as invalid because of a previous validation failure <some.other.xxxx.com. A IN>: signature expired from 192.26.92.30 for DS xxxx.com <http://amazonaws.com/>. while building chain of trust
>
>
> I suspect the fact the impact was mild, was primarily because many do not
> enable DNSSEC validation in critical places today.  At least one big
> provider who had validation enabled had to re-route away from the bad PoP
> (awesome they could do it, but that's not a reasonable expectation of
> everyone).   Had validation been widely enabled, I suspect this would have
> been a significant event.
>
> We've experienced a stale anycast pop over the years.  It's not an easy
> problem to completely avoid, so I'm sympathetic and think we should try to
> be resilient to it.
>
> Within reasonable retry limit counts and error response
>> hold-downs, bad/stale/... data from a single server should not be final
>> whether it is DNSSEC-related or not.
>>
>
> I see.  We had guessed that the stale answer was being treated as valid,
> cacheable authoritative data, similar to how an NXDOMAIN would be - and a
> resolver would not retry.  If that's not the case, that's reassuring, but
> it would be great to make sure the rfcs and implementations all agree on it.
>
> Gavin
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20230718/dc521eaf/attachment.html>