[dns-operations] [Ext] Something happening in the root?

Damian Menscher damian at google.com
Fri Jan 31 17:45:42 UTC 2020


On Fri, Jan 31, 2020 at 9:21 AM Ralf Weber <dns at fl1ger.de> wrote:

> On 30 Jan 2020, at 14:30, Ray Bellis wrote:
> > On 30/01/2020 11:19, Bill Woodcock wrote:
> >
> >> So it’s been a week…  Cloudflare folks, have you published a
> >> post-incident analysis yet?
> >
> > ISC has scheduled a post-mortem meeting with Cloudflare during NANOG.
> >
> > We won't be publishing anything before then, but in practise any
> > public
> > report is unlikely to say much that wasn't already disclosed on the
> > day
> >  (see my email of 21:44 UTC on 2020/01/23).
> >
> > The one detail I can add now is that we only received reports of one
> > (unnamed) resolver implementation being affected and that its failure
> > to
> > cope was also considered a bug.  It was only the combination of the
> > two
> > bugs together that caused any operational issue.
> Hmm can we at least agree that the root cause was the non RFC compliant
> answer from these root server instances? While it may have been only
> one implemenation now, there AFAIK is nothing in the DNS RFCs (please
> correct me if I’m wrong there) that would consider giving out SERVFAIL
> to a referal from an authoritative server it can not possible follow
> non RFC compliant.
>
> Now resolvers want to answers which is why they usually have code (some
> of which we want to get rid at DNS flag day ;-) to work around this
> and other non RFC compliant answers, which is why this may be considered
> a bug, but lets be clear about the root cause.
>

Outages of high-availability systems rarely have a single root cause.  In
this case there were (at least) two root causes:
  - Cloudflare's root servers returned an incorrect response
  - Comcast's recursive servers failed to properly handle the unexpected
response

A better way of phrasing what you're trying to say is that the bug at the
roots was the "trigger" for the outage.  But we shouldn't focus exclusively
on that -- I'd like Comcast to publish a postmortem explaining their bug
(and fixes) so we can all learn from the mistake.  I personally find it
fascinating that a system with 13x redundancy managed to break them.

Damian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20200131/86f2e796/attachment.html>


More information about the dns-operations mailing list