[dns-operations] Quad9 DNSSEC Validation?

Mon Mar 1 15:41:51 UTC 2021

TL;DR:
   - We agree: Quad9 should be more transparent about it's NTA list and 
policy; that will be forthcoming, and we hope others will do the same. 
It’s time to do that.
   - NTAs are terrible, and we wish they didn't have to exist, but... 
they do, at the moment, and not just for Quad9
   - Is anyone interested in being a central NTA manager so this can be 
less arbitrary and fractured?
   - If not, can we develop a best practice on publishing NTAs and NTA 
policies for everyone to follow?
   - Better yet: Can we (recursive DNS operators) agree to just get rid 
of NTAs entirely?

Long form:
This email is a condensed summary of a conversation Bill and I had based 
on the issues mentioned in this thread, so this text is a mix of both 
his and my comments from here on down, and several thread topics are 
combined.

Billl includes: “First of all, let me say that my reply near the 
beginning of this thread was admittedly exasperated and I took a tone 
which was too short and too snide, and I apologize for that. This is an 
issue that we’ve been trying to get people to pay attention to for 
many years, and it’s immensely frustrating when we finally get someone 
to notice… and they lay it at our doorstep. But that doesn’t make it 
any less of an issue.”

So, first things first. The comments about a lack of publishing the NTA 
list are correct and we are falling short on that, and that is something 
we need to remedy. It's been on the "to-do" list, but has not been high 
enough to score for completion in our constantly large list of 
operational work with (relatively) small non-profit resources, but we'll 
change that. We’ll have our NTA list up on our website shortly after 
with some discussion of policy with the team here of what gets domains 
put on that list and how/when they should be taken off. We've recently 
undertaken extensive review of our privacy policies and transparency 
statements, and NTAs seem to be a reasonable thing to add to the list of 
review and publication. The addition process for NTAs to date has been 
subjective, and that needs to be better documented and published, and 
the domains listed in a way that can be discovered on our website. This 
needs to be done both as assurance to our users as to the exceptions to 
our validation claims, and also hopefully as an additional indication to 
domain operators who are important enough to except but also broken 
enough to fail validation.

Adding NTAs is driven by direct complaints by end users that they cannot 
reach the resource they are trying to access - this is interrupt-driven. 
Removing NTAs has been driven by time, and testing, and available cycles 
of humans to evaluate and determine that the fault is no longer in 
place. Sometimes NTAs stay past their necessary duration, as there are 
limited resources to focus on non-interrupt items; we apologize for that 
lag in removal for some of these domains, and we think the publication 
of the list will allow others to help us remove repaired domains when 
they note that the underlying issue is no longer apparent.

As we will be undergoing this transparency process, we would hope that 
others providing similar DNS recursive services would hope to do the 
same. Kudos to Cisco for calling that out as an intended NTA publication 
concept in their policy 
(https://learn-umbrella.cisco.com/i/1202769-support-for-dnssec-in-umbrella/0?) 
but we're unable to find this dashboard (sorry if we've just not dug 
deeply enough, or perhaps it's only available to paying customers.)  
We're not able to find even a policy statement for Cloudflare, Google, 
Comcast, Deutsche Telekom, KPN, Reliance Jio or others who are actively 
enforcing strict validation about what NTAs they have in place or when 
they are added/deleted, though there are certainly discussions about 
some of those providers having NTAs in threads similar to this one over 
time. Perhaps some of these providers have public NTA lists, but some 
quick searching did not find anything obvious - does anyone have 
pointers?

So, let’s all do this.(*) That will help people understand the scope 
of the problem, and we hope that it will get the discussion moving 
again. We would actually like to see some sort of "best practices" 
policy for NTA implementation, or at least NTA declaration, or perhaps 
our publication of our methods might move towards that as an agreeable 
first attempt at a best practice. Ideally, the best possible case would 
to be having no NTAs at all, but it's clear that most resolver operators 
have NTAs in place in a non-zero volume. We hope we can come up with a 
way to use them as levers to improve security with those domains, rather 
than just create hidden exceptions.

Is anyone else here interested in the discussion about a standardized 
method of NTA publication and policy statement publication? The 
discussions about privacy policy went exceptionally well in that regard 
leading to RFC8932, though this topic of NTA transparency is a much 
smaller slice of policy framing. There perhaps may be some other better 
forum in which to move that discussion, though making it an IETF Draft 
discussion or BCP may be somewhat heavy for the need.

>  On Feb 28, 2021, at 8:38 PM, Paul Vixie <paul at redbarn.org> wrote:
>  the technology of negative trust anchors is exactly as wrongheaded as 
>  it can possibly be. the pressure to not break stuff should be 
> unrelenting,  and the cost of breaking it should be extreme.

Yep, this is exactly correct. Honestly, we wouldn’t have started all 
this if we’d thought that we were going to be relying on NTAs. We 
launched with DNSSEC strict validation three years ago. We were naively 
optimistic, and got lucky to some degree - there were only a few problem 
domains (though some were still quite large, depth-wise, such as .gov 
and .mil) and overall the process has been good with few complaints that 
warranted NTAs, though sporadic exceptions needed to be made. It's been 
encouraging to see strict validation becoming the standard for most 
large resolvers, which is progress!  But we (meaning "large strict 
DNSSEC resolver operators") are all doing with a few NTAs, because 
although the world isn’t as bad a place as many DNSSEC naysayers 
thought it was, it’s also not as good a place as we hoped it’d be, 
either.

So to your point: Yes, we would very much like to see a world without 
NTAs, where everyone validated DNSSEC in a strict fashion such that 
problems were painful and immediate to domain operators with faults. 
Let's see what we can do to move towards that goal - we really like that 
idea. However, if that isn't the immediate result, can we all agree on a 
method to publish data that makes these exceptions less frequent and 
shorter in duration?  We pledge to have more transparency, but it would 
be disappointing if we were the only ones to do so.

> also, negative trust anchors aren’t part of the global MIB, and lead 
> to different
> behaviour for different users.

Well, kind of.  But only incidentally for different users.  Really, 
behavior is different based on which resolver the user is pointed at.

As long as each recursive resolver implements NTAs silently and 
independently, there’s not 100% overlap between them, and users just 
shop resolvers until they find one with the NTA that allows them to 
still reach af.mil or the CDC or mail.mil, or whatever. The user blames 
the resolver that doesn’t have an NTA and praises the one that does 
have an NTA (or which doesn't do DNSSEC at all!) No pressure is exerted 
on the actual offending party, and resolver operators wind up having to 
juggle the subjective risks and benefits of NTAs versus user 
departure/complaints/confusion.

Again to your point: Consistent failures are explainable; inconsistent 
failures are not.  "Well, it works on a.b.c.d but not on 9.9.9.9" is a 
difficult problem to solve when the white-hot anger of tens or hundreds 
of thousands of end users is applied to the support structures of a 
platform which can no longer resolve an important address that has just 
broken either DNSSEC or some other authoritative-side issue which can be 
worked around by resolver operators jumping through hoops. Even if the 
problem is explainable ("The domain operator broke their own DNSSEC,") a 
result that leads to end users moving to a non-DNSSEC platform or 
NTA-excepted platform is a less than ideal result, but that's what we 
face.  Other providers have NTAs, so we have NTAs.

> On Feb 28, 2021, at 9:14 PM, Vladimír Čunát 
> <vladimir.cunat+ietf at nic.cz> wrote:
> My (naive?) hope is that large validating services could form some 
> agreement to start
> acting stricter in this respect.  Of course it's often hard to argue 
> that a breakage is the
> domain's fault as long as it works almost everywhere else, but 
> dnsflagday.net has shown that similar arrangements are possible to 
> pull off.

Yes, exactly.  This is a prisoner's dilemma problem, and everyone is 
defecting on their own terms - not a good situation.

There have been several hallway discussions at DNS-OARC and other 
forums, back when hallway discussions  were a thing (or did it make it 
into a list discussion?) about creating shared NTA lists or at least 
everyone publicly publishing or stating their NTAs in some standardized 
way that the "greater DNS community" could see what might need temporary 
workarounds. We’d very much like to be using a list that was publicly 
available and was formed and managed through public discussion. That 
would solve two goals: first, it would name-and-shame the folks who are 
so broken that they have to be put on the list; second, it would take 
care of all the resolver-shopping by users. If something caused a DNSSEC 
failure on one, it would DNSSEC fail on the others as well. Then there 
would no longer be competitive pressure to add NTAs. It seems unlikely 
however that there could be a centralized NTA list - there were fears 
voiced of responsibility (aka: lawsuit,) mis-use or fault, and security. 
Though if some neutral party could create it, we would closely evaluate 
using such a list if it was responsive to our specific customer 
requests, and was secure. It would be surprising but welcome to see 
someone step up to this task, though DNS-OARC would be on the short list 
of candidates. As noted above, we would really just prefer a world where 
NTAs were entirely abandoned by enough of the significant operational 
community that it became impossible for a domain operator to continue 
with faults. Are we there yet?

> On Feb 28, 2021, at 7:09 PM, Scott Morizot <tmorizot at gmail.com> wrote:
> It is supposed to be temporary and domain name specific. In fact, the 
> informational
> RFC states that technical personnel should ensure it is due to a 
> misconfiguration
> and not the sort of attack DNSSEC is intended to prevent and that they 
> should make every reasonable attempt to contact the domain owner.

Yep, all those are the case.  Quad9 implements NTAs specifically, 
temporarily, after determining that it’s a misconfiguration, and then 
also making a reasonable attempt to contact the domain name owner (SOA 
email addresses or RFC2142 addresses are typically used, but that is 
another thread of woe, so we end up scraping websites and often in 
languages that are not typically used by our support desk - we do make 
the effort.)  We are quite often successful in reaching domain operators 
and informing them that their DNSSEC is not functioning as expected, and 
that typically precludes any NTA addition - I think the summary here is 
that NTAs are quite rare, and we do try to help authoritative operators 
identify their problems. Most NTAs can be removed after short 
application and repair by the domain operator.

Zones under .GOV have been a continuous challenge, as have those within 
.MIL. There were wide-ranging faults in those TLDs for some time, 
creating continuous and new support threads. The move towards mandatory 
DNSSEC for those zones was admirable, and we think was the right 
fundamental decision, but the operational reality of a first-mover 
project caused many lumps in the process. There are fewer issues now, 
and we're encouraged to see so much of this domain space signed. Is it 
time to remove those NTAs?  Almost certainly, and we agree that today 
those are too broad a set of exceptions. The remaining zones that are 
failing strict validation under those top-level domains will have to be 
contacted as the faults arise, and possibly more specific NTAs 
re-implemented if they continue to cause a high enough complaint ratio. 
Or maybe we reinstall no NTAs in those TLDs if the problems have 
subsided to a level that allows more specific focus on just a few faulty 
zones, to produce the pain required for repair.

Perversely, the more users one has who are in US government sector 
areas, the more severe the problems when zones within .gov failed 
previously due to DNSSEC errors, and the more rapidly the users shifted 
away to non-DNSSEC resolvers in those problem events. As many of our 
beta-user base several years ago were US-based state, local, and small 
federal offices, this led to Quad9 being more than normally sensitive to 
faults on zones within those TLDs. This is not an excuse, but is some 
background on why those two particular zones were so broadly excepted.

> At the IRS, most of our DNS is signed.

We are in fervent agreement that important domains like the IRS.gov 
should be signed, and all domains ultimately, and we've been 
disappointed that there was enough breakage in .GOV that caused 
continual support challenges. Too much time has passed since a full NTA 
review on our side, and we need to focus on just the domains that 
continue to be faulty and which cause our end users the most difficulty. 
We agree that needs to be a more transparent list, and a more 
transparent policy, and we'll make that happen soon - thank you for 
calling us out on this, and we'll do better, and we hope that leads to 
everyone else moving in that same direction of transparency.

(*) Can we short-circuit this whole issue, perhaps? Have we reached a 
world where strict validation of DNSSEC is now viable, with no NTAs? I 
think it is worth evaluating, because even if that day is not today or 
this year then when would it be? How could we determine the viability of 
such a shift?  If NTA elimination was a DNS Flag Day event for 
strict-validating recursive operators, where some significant portion of 
the largest resolvers agreed on that policy, I know that would make 
everyone here exceptionally happy. This whole subjective-decision issue 
could go away and functional comparisons against other large recursive 
resolver arrays (open or closed) would not have any differences in 
DNSSEC results, at least none that would be able to be blamed on "manual 
exceptions." I think this deserves to be broken out into a separate 
thread of discussion if anyone wishes to continue the conversation, as 
this is not a Quad9-specific aspiration.

--
John Todd - jtodd at quad9.net - +1-415-831-3123
General Manager - Quad9 Recursive Resolver
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20210301/f983192c/attachment.html>