<div dir="ltr"><div dir="ltr"></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 31, 2020 at 9:21 AM Ralf Weber <<a href="mailto:dns@fl1ger.de">dns@fl1ger.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 30 Jan 2020, at 14:30, Ray Bellis wrote:<br>> On 30/01/2020 11:19, Bill Woodcock wrote:<br>

><br>

>> So it’s been a week…  Cloudflare folks, have you published a <br>

>> post-incident analysis yet?<br>

><br>

> ISC has scheduled a post-mortem meeting with Cloudflare during NANOG.<br>

><br>

> We won't be publishing anything before then, but in practise any <br>

> public<br>

> report is unlikely to say much that wasn't already disclosed on the <br>

> day<br>

>  (see my email of 21:44 UTC on 2020/01/23).<br>

><br>

> The one detail I can add now is that we only received reports of one<br>

> (unnamed) resolver implementation being affected and that its failure <br>

> to<br>

> cope was also considered a bug.  It was only the combination of the <br>

> two<br>

> bugs together that caused any operational issue.<br>

Hmm can we at least agree that the root cause was the non RFC compliant<br>

answer from these root server instances? While it may have been only<br>

one implemenation now, there AFAIK is nothing in the DNS RFCs (please<br>

correct me if I’m wrong there) that would consider giving out SERVFAIL<br>

to a referal from an authoritative server it can not possible follow<br>

non RFC compliant.<br>

<br>

Now resolvers want to answers which is why they usually have code (some<br>

of which we want to get rid at DNS flag day ;-) to work around this<br>

and other non RFC compliant answers, which is why this may be considered<br>

a bug, but lets be clear about the root cause.<br></blockquote><div><br></div><div>Outages of high-availability systems rarely have a single root cause.  In this case there were (at least) two root causes:</div><div>  - Cloudflare's root servers returned an incorrect response</div><div>  - Comcast's recursive servers failed to properly handle the unexpected response</div><div><br></div><div>A better way of phrasing what you're trying to say is that the bug at the roots was the "trigger" for the outage.  But we shouldn't focus exclusively on that -- I'd like Comcast to publish a postmortem explaining their bug (and fixes) so we can all learn from the mistake.  I personally find it fascinating that a system with 13x redundancy managed to break them.</div><div><br></div><div>Damian</div></div></div>