[dns-operations] Quad9 DNSSEC Validation?
John Todd
jtodd at quad9.net
Mon Mar 1 15:41:51 UTC 2021
TL;DR:
- We agree: Quad9 should be more transparent about it's NTA list and
policy; that will be forthcoming, and we hope others will do the same.
It’s time to do that.
- NTAs are terrible, and we wish they didn't have to exist, but...
they do, at the moment, and not just for Quad9
- Is anyone interested in being a central NTA manager so this can be
less arbitrary and fractured?
- If not, can we develop a best practice on publishing NTAs and NTA
policies for everyone to follow?
- Better yet: Can we (recursive DNS operators) agree to just get rid
of NTAs entirely?
Long form:
This email is a condensed summary of a conversation Bill and I had based
on the issues mentioned in this thread, so this text is a mix of both
his and my comments from here on down, and several thread topics are
combined.
Billl includes: “First of all, let me say that my reply near the
beginning of this thread was admittedly exasperated and I took a tone
which was too short and too snide, and I apologize for that. This is an
issue that we’ve been trying to get people to pay attention to for
many years, and it’s immensely frustrating when we finally get someone
to notice… and they lay it at our doorstep. But that doesn’t make it
any less of an issue.”
So, first things first. The comments about a lack of publishing the NTA
list are correct and we are falling short on that, and that is something
we need to remedy. It's been on the "to-do" list, but has not been high
enough to score for completion in our constantly large list of
operational work with (relatively) small non-profit resources, but we'll
change that. We’ll have our NTA list up on our website shortly after
with some discussion of policy with the team here of what gets domains
put on that list and how/when they should be taken off. We've recently
undertaken extensive review of our privacy policies and transparency
statements, and NTAs seem to be a reasonable thing to add to the list of
review and publication. The addition process for NTAs to date has been
subjective, and that needs to be better documented and published, and
the domains listed in a way that can be discovered on our website. This
needs to be done both as assurance to our users as to the exceptions to
our validation claims, and also hopefully as an additional indication to
domain operators who are important enough to except but also broken
enough to fail validation.
Adding NTAs is driven by direct complaints by end users that they cannot
reach the resource they are trying to access - this is interrupt-driven.
Removing NTAs has been driven by time, and testing, and available cycles
of humans to evaluate and determine that the fault is no longer in
place. Sometimes NTAs stay past their necessary duration, as there are
limited resources to focus on non-interrupt items; we apologize for that
lag in removal for some of these domains, and we think the publication
of the list will allow others to help us remove repaired domains when
they note that the underlying issue is no longer apparent.
As we will be undergoing this transparency process, we would hope that
others providing similar DNS recursive services would hope to do the
same. Kudos to Cisco for calling that out as an intended NTA publication
concept in their policy
(https://learn-umbrella.cisco.com/i/1202769-support-for-dnssec-in-umbrella/0?)
but we're unable to find this dashboard (sorry if we've just not dug
deeply enough, or perhaps it's only available to paying customers.)
We're not able to find even a policy statement for Cloudflare, Google,
Comcast, Deutsche Telekom, KPN, Reliance Jio or others who are actively
enforcing strict validation about what NTAs they have in place or when
they are added/deleted, though there are certainly discussions about
some of those providers having NTAs in threads similar to this one over
time. Perhaps some of these providers have public NTA lists, but some
quick searching did not find anything obvious - does anyone have
pointers?
So, let’s all do this.(*) That will help people understand the scope
of the problem, and we hope that it will get the discussion moving
again. We would actually like to see some sort of "best practices"
policy for NTA implementation, or at least NTA declaration, or perhaps
our publication of our methods might move towards that as an agreeable
first attempt at a best practice. Ideally, the best possible case would
to be having no NTAs at all, but it's clear that most resolver operators
have NTAs in place in a non-zero volume. We hope we can come up with a
way to use them as levers to improve security with those domains, rather
than just create hidden exceptions.
Is anyone else here interested in the discussion about a standardized
method of NTA publication and policy statement publication? The
discussions about privacy policy went exceptionally well in that regard
leading to RFC8932, though this topic of NTA transparency is a much
smaller slice of policy framing. There perhaps may be some other better
forum in which to move that discussion, though making it an IETF Draft
discussion or BCP may be somewhat heavy for the need.
> On Feb 28, 2021, at 8:38 PM, Paul Vixie <paul at redbarn.org> wrote:
> the technology of negative trust anchors is exactly as wrongheaded as
> it can possibly be. the pressure to not break stuff should be
> unrelenting, and the cost of breaking it should be extreme.
Yep, this is exactly correct. Honestly, we wouldn’t have started all
this if we’d thought that we were going to be relying on NTAs. We
launched with DNSSEC strict validation three years ago. We were naively
optimistic, and got lucky to some degree - there were only a few problem
domains (though some were still quite large, depth-wise, such as .gov
and .mil) and overall the process has been good with few complaints that
warranted NTAs, though sporadic exceptions needed to be made. It's been
encouraging to see strict validation becoming the standard for most
large resolvers, which is progress! But we (meaning "large strict
DNSSEC resolver operators") are all doing with a few NTAs, because
although the world isn’t as bad a place as many DNSSEC naysayers
thought it was, it’s also not as good a place as we hoped it’d be,
either.
So to your point: Yes, we would very much like to see a world without
NTAs, where everyone validated DNSSEC in a strict fashion such that
problems were painful and immediate to domain operators with faults.
Let's see what we can do to move towards that goal - we really like that
idea. However, if that isn't the immediate result, can we all agree on a
method to publish data that makes these exceptions less frequent and
shorter in duration? We pledge to have more transparency, but it would
be disappointing if we were the only ones to do so.
> also, negative trust anchors aren’t part of the global MIB, and lead
> to different
> behaviour for different users.
Well, kind of. But only incidentally for different users. Really,
behavior is different based on which resolver the user is pointed at.
As long as each recursive resolver implements NTAs silently and
independently, there’s not 100% overlap between them, and users just
shop resolvers until they find one with the NTA that allows them to
still reach af.mil or the CDC or mail.mil, or whatever. The user blames
the resolver that doesn’t have an NTA and praises the one that does
have an NTA (or which doesn't do DNSSEC at all!) No pressure is exerted
on the actual offending party, and resolver operators wind up having to
juggle the subjective risks and benefits of NTAs versus user
departure/complaints/confusion.
Again to your point: Consistent failures are explainable; inconsistent
failures are not. "Well, it works on a.b.c.d but not on 9.9.9.9" is a
difficult problem to solve when the white-hot anger of tens or hundreds
of thousands of end users is applied to the support structures of a
platform which can no longer resolve an important address that has just
broken either DNSSEC or some other authoritative-side issue which can be
worked around by resolver operators jumping through hoops. Even if the
problem is explainable ("The domain operator broke their own DNSSEC,") a
result that leads to end users moving to a non-DNSSEC platform or
NTA-excepted platform is a less than ideal result, but that's what we
face. Other providers have NTAs, so we have NTAs.
> On Feb 28, 2021, at 9:14 PM, Vladimír Čunát
> <vladimir.cunat+ietf at nic.cz> wrote:
> My (naive?) hope is that large validating services could form some
> agreement to start
> acting stricter in this respect. Of course it's often hard to argue
> that a breakage is the
> domain's fault as long as it works almost everywhere else, but
> dnsflagday.net has shown that similar arrangements are possible to
> pull off.
Yes, exactly. This is a prisoner's dilemma problem, and everyone is
defecting on their own terms - not a good situation.
There have been several hallway discussions at DNS-OARC and other
forums, back when hallway discussions were a thing (or did it make it
into a list discussion?) about creating shared NTA lists or at least
everyone publicly publishing or stating their NTAs in some standardized
way that the "greater DNS community" could see what might need temporary
workarounds. We’d very much like to be using a list that was publicly
available and was formed and managed through public discussion. That
would solve two goals: first, it would name-and-shame the folks who are
so broken that they have to be put on the list; second, it would take
care of all the resolver-shopping by users. If something caused a DNSSEC
failure on one, it would DNSSEC fail on the others as well. Then there
would no longer be competitive pressure to add NTAs. It seems unlikely
however that there could be a centralized NTA list - there were fears
voiced of responsibility (aka: lawsuit,) mis-use or fault, and security.
Though if some neutral party could create it, we would closely evaluate
using such a list if it was responsive to our specific customer
requests, and was secure. It would be surprising but welcome to see
someone step up to this task, though DNS-OARC would be on the short list
of candidates. As noted above, we would really just prefer a world where
NTAs were entirely abandoned by enough of the significant operational
community that it became impossible for a domain operator to continue
with faults. Are we there yet?
> On Feb 28, 2021, at 7:09 PM, Scott Morizot <tmorizot at gmail.com> wrote:
> It is supposed to be temporary and domain name specific. In fact, the
> informational
> RFC states that technical personnel should ensure it is due to a
> misconfiguration
> and not the sort of attack DNSSEC is intended to prevent and that they
> should make every reasonable attempt to contact the domain owner.
Yep, all those are the case. Quad9 implements NTAs specifically,
temporarily, after determining that it’s a misconfiguration, and then
also making a reasonable attempt to contact the domain name owner (SOA
email addresses or RFC2142 addresses are typically used, but that is
another thread of woe, so we end up scraping websites and often in
languages that are not typically used by our support desk - we do make
the effort.) We are quite often successful in reaching domain operators
and informing them that their DNSSEC is not functioning as expected, and
that typically precludes any NTA addition - I think the summary here is
that NTAs are quite rare, and we do try to help authoritative operators
identify their problems. Most NTAs can be removed after short
application and repair by the domain operator.
Zones under .GOV have been a continuous challenge, as have those within
.MIL. There were wide-ranging faults in those TLDs for some time,
creating continuous and new support threads. The move towards mandatory
DNSSEC for those zones was admirable, and we think was the right
fundamental decision, but the operational reality of a first-mover
project caused many lumps in the process. There are fewer issues now,
and we're encouraged to see so much of this domain space signed. Is it
time to remove those NTAs? Almost certainly, and we agree that today
those are too broad a set of exceptions. The remaining zones that are
failing strict validation under those top-level domains will have to be
contacted as the faults arise, and possibly more specific NTAs
re-implemented if they continue to cause a high enough complaint ratio.
Or maybe we reinstall no NTAs in those TLDs if the problems have
subsided to a level that allows more specific focus on just a few faulty
zones, to produce the pain required for repair.
Perversely, the more users one has who are in US government sector
areas, the more severe the problems when zones within .gov failed
previously due to DNSSEC errors, and the more rapidly the users shifted
away to non-DNSSEC resolvers in those problem events. As many of our
beta-user base several years ago were US-based state, local, and small
federal offices, this led to Quad9 being more than normally sensitive to
faults on zones within those TLDs. This is not an excuse, but is some
background on why those two particular zones were so broadly excepted.
> At the IRS, most of our DNS is signed.
We are in fervent agreement that important domains like the IRS.gov
should be signed, and all domains ultimately, and we've been
disappointed that there was enough breakage in .GOV that caused
continual support challenges. Too much time has passed since a full NTA
review on our side, and we need to focus on just the domains that
continue to be faulty and which cause our end users the most difficulty.
We agree that needs to be a more transparent list, and a more
transparent policy, and we'll make that happen soon - thank you for
calling us out on this, and we'll do better, and we hope that leads to
everyone else moving in that same direction of transparency.
(*) Can we short-circuit this whole issue, perhaps? Have we reached a
world where strict validation of DNSSEC is now viable, with no NTAs? I
think it is worth evaluating, because even if that day is not today or
this year then when would it be? How could we determine the viability of
such a shift? If NTA elimination was a DNS Flag Day event for
strict-validating recursive operators, where some significant portion of
the largest resolvers agreed on that policy, I know that would make
everyone here exceptionally happy. This whole subjective-decision issue
could go away and functional comparisons against other large recursive
resolver arrays (open or closed) would not have any differences in
DNSSEC results, at least none that would be able to be blamed on "manual
exceptions." I think this deserves to be broken out into a separate
thread of discussion if anyone wishes to continue the conversation, as
this is not a Quad9-specific aspiration.
--
John Todd - jtodd at quad9.net - +1-415-831-3123
General Manager - Quad9 Recursive Resolver
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20210301/f983192c/attachment.html>
More information about the dns-operations
mailing list