[dns-operations] dnssec-failed.org and dns.google

Fri Aug 16 00:15:01 UTC 2019

On Thu, Aug 15, 2019 at 9:13 PM Puneet Sood via dns-operations
<dns-operations at dns-oarc.net> wrote:
> On Thu, Aug 15, 2019 at 9:38 AM Livingood, Jason
> <Jason_Livingood at comcast.com> wrote:
> >
> > Hi Puneet - Given the extraordinary level of use of Google Public DNS as a recursive resolver - reliability, predictability, standards compliance, and transparency are especially important. So I wonder if you wouldn't mind more fully describing what the change was that was rolled out (seems something related to DNSSEC validation) - or at least what it was intended to do. Clearly in this case, a purposely broken DNSSEC domain should never properly resolve when validation is performed.
>
> We have been making changes to improve our handling of unsupported
> algorithms and other corner cases. The breakage here was due to a bug
> in the case when there is a DS record in the parent but no matching
> DNSKEY record in the child zone.
>
> We  have monitoring using public DNSSEC test domains (including
> dnssec-failed.org) to catch such problems early. Due to an unfortunate
> coincidence the monitoring system had a misconfiguration at the same
> time as the rollout of the DNSSEC validation changes.

This thread has focused on dnssec-failed.org, but it sounds like this
may have been a serious bug in DNSSEC validation.

Is that true? If an attacker poisoned a zone's DNSKEY record set, or
something, was this exploitable?

>From Public DNS's available logs, can Google determine how many
queries might have been affected by this bug?

Or if there are indications that the bug -- maliciously or
accidentally -- allowed bogus data to be served?

> > Also, I am not well acquainted with how your management systems work but hopefully you have a mechanism for more rapid rollback/deployment of new code when there are more critical security flaws on systems. Certainly in this case I think this is a low priority security issue but for a higher priority issue should probably be fixed more rapidly than 24-hours plus longer for some cached responses. I'd imagine you could both force a cache flush across all systems on a domain-level basis and push code more aggressively if needed (given the right risk/reward balance, which is not really met in this instance).
>
> We do have the ability to rollback changes faster and to flush
> specific bad domains from our caches. In this case, we decided the
> situation did not warrant using the faster deployment options.
>
> >
> > On a related note, I have noticed that Google Public DNS and some other resolvers will in some DNSSEC conditions provide valid responses when strictly speaking a validation failure would be the standards-based response - perhaps by using automated Negative Trust Anchors (NTAs) in certain circumstances (e.g. recent key rollover detected). My guess is that this is likely considered friendly for users in some scenarios and probably helpful to furthering DNSSEC deployment in the short-term. IMO it may be useful for any resolvers doing that sort of thing to write it up in an informational I-D so that folks can understand what's happening and it can inform future standards development for how validation should work.
>
> We do not use automated negative trust anchors - we configure them
> manually on an infrequent basis. If you notice real problems, do let
> us know through one of the ways (mailing list or issue tracker)
> described at https://developers.google.com/speed/public-dns/groups.
>
> Thanks,
> Puneet
> for the Google Public DNS team
-- 
Matt Nordhoff