[dns-operations] DNS .com/.net resolution problems in the Asia/Pacific region

Tue Jul 18 22:43:22 UTC 2023

Except BIND does exactly this.  It retries and if all the servers for the zone fail the <name,type> is flagged as bad for 10 minutes and any validation that depends on that lookup fails with DNS_R_BROKENCHAIN which results in SERVFAIL rather than a retry.  This was how we dealt with the so called “rollover and die” issue.

                } else if (result == DNS_R_BROKENCHAIN) {
                        isc_result_t tresult;
                        isc_time_t expire;
                        isc_interval_t i;

                        isc_interval_set(&i, DNS_RESOLVER_BADCACHETTL(fctx), 0);
                        tresult = isc_time_nowplusinterval(&expire, &i);
                        if (negative &&
                            (fctx->type == dns_rdatatype_dnskey ||
                             fctx->type == dns_rdatatype_ds) &&
                            tresult == ISC_R_SUCCESS)
                        {
                                dns_resolver_addbadcache(res, fctx->name,
                                                         fctx->type, &expire);
                        }
                        done = true;
                        goto cleanup_fetchctx;
                } else {
                        fctx_try(fctx, true, true);
                        goto cleanup_fetchctx;
                }

The world doesn’t fall over with limited retries.  We had zero reports resolution failures due to this incident.  This also allows a validator behind a validator to work reliably by having the validator that talks directly to the authoritative servers filter out the garbage responses.  Always send CD=1 is STUPID.

> On 19 Jul 2023, at 04:54, Ondřej Surý <ondrej at sury.org> wrote:
> 
> With my implementor’s hat on, I think this is wrong approach. It (again) adds a complexity to the resolvers and yet again based (mostly) on isolated incident. I really don’t want yet another “serve-stale” in the resolvers. I have to yet see an evidence that serve-stale has helped anything since the original incident, but now every resolver has to have it because people want it.
> 
> And operationally, it will just pamper over the issue which might then go unnoticed for longer period of time rather than being fixed right away.
> 
> Ondrej
> --
> Ondřej Surý <ondrej at sury.org> (He/Him)
> 
>> On 18. 7. 2023, at 20:38, Gavin McCullagh <gmccullagh at gmail.com> wrote:
>> 
>> I'd like to reach out to NLNet about changing Unbound to do this, so I want to make sure people have a chance to disagree.  Feel free to voice your disagreement (and reasons) here if you do.
> 
> 
> _______________________________________________
> dns-operations mailing list
> dns-operations at lists.dns-oarc.net
> https://lists.dns-oarc.net/mailman/listinfo/dns-operations

-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742              INTERNET: marka at isc.org