[dns-operations] [Ext] Something happening in the root?
Damian Menscher
damian at google.com
Fri Jan 31 17:45:42 UTC 2020
On Fri, Jan 31, 2020 at 9:21 AM Ralf Weber <dns at fl1ger.de> wrote:
> On 30 Jan 2020, at 14:30, Ray Bellis wrote:
> > On 30/01/2020 11:19, Bill Woodcock wrote:
> >
> >> So it’s been a week… Cloudflare folks, have you published a
> >> post-incident analysis yet?
> >
> > ISC has scheduled a post-mortem meeting with Cloudflare during NANOG.
> >
> > We won't be publishing anything before then, but in practise any
> > public
> > report is unlikely to say much that wasn't already disclosed on the
> > day
> > (see my email of 21:44 UTC on 2020/01/23).
> >
> > The one detail I can add now is that we only received reports of one
> > (unnamed) resolver implementation being affected and that its failure
> > to
> > cope was also considered a bug. It was only the combination of the
> > two
> > bugs together that caused any operational issue.
> Hmm can we at least agree that the root cause was the non RFC compliant
> answer from these root server instances? While it may have been only
> one implemenation now, there AFAIK is nothing in the DNS RFCs (please
> correct me if I’m wrong there) that would consider giving out SERVFAIL
> to a referal from an authoritative server it can not possible follow
> non RFC compliant.
>
> Now resolvers want to answers which is why they usually have code (some
> of which we want to get rid at DNS flag day ;-) to work around this
> and other non RFC compliant answers, which is why this may be considered
> a bug, but lets be clear about the root cause.
>
Outages of high-availability systems rarely have a single root cause. In
this case there were (at least) two root causes:
- Cloudflare's root servers returned an incorrect response
- Comcast's recursive servers failed to properly handle the unexpected
response
A better way of phrasing what you're trying to say is that the bug at the
roots was the "trigger" for the outage. But we shouldn't focus exclusively
on that -- I'd like Comcast to publish a postmortem explaining their bug
(and fixes) so we can all learn from the mistake. I personally find it
fascinating that a system with 13x redundancy managed to break them.
Damian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20200131/86f2e796/attachment.html>
More information about the dns-operations
mailing list