<div dir="ltr"><div dir="ltr"></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 31, 2020 at 9:21 AM Ralf Weber <<a href="mailto:dns@fl1ger.de">dns@fl1ger.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 30 Jan 2020, at 14:30, Ray Bellis wrote:<br>> On 30/01/2020 11:19, Bill Woodcock wrote:<br>
><br>
>> So it’s been a week… Cloudflare folks, have you published a <br>
>> post-incident analysis yet?<br>
><br>
> ISC has scheduled a post-mortem meeting with Cloudflare during NANOG.<br>
><br>
> We won't be publishing anything before then, but in practise any <br>
> public<br>
> report is unlikely to say much that wasn't already disclosed on the <br>
> day<br>
> (see my email of 21:44 UTC on 2020/01/23).<br>
><br>
> The one detail I can add now is that we only received reports of one<br>
> (unnamed) resolver implementation being affected and that its failure <br>
> to<br>
> cope was also considered a bug. It was only the combination of the <br>
> two<br>
> bugs together that caused any operational issue.<br>
Hmm can we at least agree that the root cause was the non RFC compliant<br>
answer from these root server instances? While it may have been only<br>
one implemenation now, there AFAIK is nothing in the DNS RFCs (please<br>
correct me if I’m wrong there) that would consider giving out SERVFAIL<br>
to a referal from an authoritative server it can not possible follow<br>
non RFC compliant.<br>
<br>
Now resolvers want to answers which is why they usually have code (some<br>
of which we want to get rid at DNS flag day ;-) to work around this<br>
and other non RFC compliant answers, which is why this may be considered<br>
a bug, but lets be clear about the root cause.<br></blockquote><div><br></div><div>Outages of high-availability systems rarely have a single root cause. In this case there were (at least) two root causes:</div><div> - Cloudflare's root servers returned an incorrect response</div><div> - Comcast's recursive servers failed to properly handle the unexpected response</div><div><br></div><div>A better way of phrasing what you're trying to say is that the bug at the roots was the "trigger" for the outage. But we shouldn't focus exclusively on that -- I'd like Comcast to publish a postmortem explaining their bug (and fixes) so we can all learn from the mistake. I personally find it fascinating that a system with 13x redundancy managed to break them.</div><div><br></div><div>Damian</div></div></div>