[dns-operations] OpenDNS, Google, Nominet - New delegation update failure mode

Shumon Huque shuque at gmail.com
Fri Apr 3 17:20:10 UTC 2020


On Fri, Apr 3, 2020 at 11:59 AM Ralf Weber <dns at fl1ger.de> wrote:

> Well it was you think and others (including me) disagree for valid reasons.
> There is absolutely no reason to issue queries for some validation, when
> you already got good results.
>
> I see this is a workaround for people to lazy to update the delegations,
> and put more complexity and work on resolvers.
>

Dear Ralf,

It is possible that there exist some people who want this because they
are "too lazy" to update delegations. But I strongly suspect there are other
reasons.

Let me explain why I am personally interested in having this behavior
implemented widely. We can take care to make sure the contents of
parent/child NS sets are always in sync (and we do). What we cannot
control is the TTL value of the NS RRset in the parent. Many TLDs are
quite inflexible in this regard and only support long (~2 day) TTLs. (I
know some people will immediately say let's fix this. But we have to live
in the real world of TLDs. Folks have been asking for this forever, and
there has been no movement. And if there is movement, that will happen
on the timescale of many years or longer).

Normally, this is not an issue for us, as we prefer long TTLs on zone
infrastructure records for stability and performance reasons. The issue
arises when we are making changes to the infrastructure, such as
migrating to another DNS provider, or deploying DNSSEC etc. We want
to make sure we can very quickly backout changes if we encounter
unanticipated problems, by temporarily deploying a short TTL.

To give you a real case, some time last year, we signed and migrated
some of our important zones to a set of new providers, after extensive
testing (verifying the zones were correctly deployed and signed, detailed
pre-delegation testing, distributed monitoring of the provider footprints
etc).
A couple of days after pulling the trigger, we discovered breakage in a
particular region of the world where one of the provider's servers were
misconfigured. We weren't able to catch this pre-deployment, since our
distributed monitoring did not include nodes in the anycast catchment
area(s) of these broken servers. So, we had to backout the change, and
then deal with the lingering up-to-2-day effect of the parent NS TTL (for
parent centric resolvers).

To fend off these kinds of issues, there are some well known infrastructure
operators that configure their resolvers to enforce a maximum cache TTL
of only 60 seconds. Should we be advocating things like that? :)

(There is a larger philosophical question that I will avoid for now, about
why resolvers should prefer non-authoritative glue, which cannot be signed,
over signed authoritative data, and whether or not we should redesign
the DNS delegation mechanism to fix that. The security of DNSSEC does
not currently rely on signed nameserver records, but why not try to catch
spoofed delegation data as early as possible, at its source?).

Shumon Huque
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20200403/44239524/attachment.html>


More information about the dns-operations mailing list