<div dir="ltr"><div dir="auto"><br><br><div class="gmail_quote" dir="auto"><div dir="ltr" class="gmail_attr">On Wed, Jul 12, 2023, 5:28 PM Olafur Gudmundsson <<a href="mailto:ogud@ogud.com" target="_blank">ogud@ogud.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="line-break:after-white-space"><br><div><br><blockquote type="cite"><div>On Jul 11, 2023, at 8:24 PM, Gavin McCullagh <<a href="mailto:gmccullagh@gmail.com" rel="noreferrer" target="_blank">gmccullagh@gmail.com</a>> wrote:</div><br><div><span style="font-family:Helvetica;font-size:14px;font-style:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none;float:none;display:inline!important">That is true of course, but the magnitude of this event was made much worse by dnssec. The entire COM and NET zones being bogus (including the unsigned delegations) is very different to the few that saw record changes in the prior 1-2 days. </span></div></blockquote></div><div><br></div><div>As explained in other messages this is not true due to the rolling expiry </div><div>but on the other hand DNSSEC may have exposed the problem much sooner than otherwise, I once helped a TLD operator to find a single anycast instance that they assumed was off line for the few weeks but it was still serving an expired zone. </div></div></blockquote></div><div dir="auto"></div><div dir="auto"><br></div><div>Agreed, the "entire zone" isn't true (thanks for explaining), but the magnitude being "much worse" does still seem true. Maybe only 25% of the zone went bogus, but that's surely much larger than the domains that had changed during the time. I guess it is true that we will discover more severe outages quicker than less severe ones, but I'm not sure that justifies making events more severe. A stale COM pop pre-dnssec was harmful to O(modified domains). Under DNSSEC, a stale pop is harmful to O(all domains) [25%]. It's unclear that it needs to be that way.<br></div><div><br></div><div>When faced with ~4x obviously bogus, broken nameservers (the stale pop) and ~9x fresh working nameservers with valid signatures, the DNSSEC RFCs appear to specify (and Unbound appears to implement) that resolvers must accept and cache the obviously bogus (expired rrsig) answers and return SERVFAIL to clients until those bogus answers expire (apparently 24 hours later in this case?), rather than immediately considering those responses invalid, those nameservers as lame and retrying against another - assuming this response is either coming from an attacker or from a broken nameserver. <br></div><div><br></div><div>Obviously we need to be careful about creating retry storms, but that retry storm is pretty much equivalent to what would happen if that PoP were to return SERVFAIL or not respond, so that doesn't seem like a new problem. <br></div><div><br></div><div>I assume lots of us on this mailing list operate authoritative dns servers. When one of our PoPs or nameservers is unresponsive, most of us rely on retries against other nameservers (aka PoPs) to ensure this is a non-event. When one of our PoPs is serving stale, it is varying levels of bad depending on how stale it is. But under DNSSEC stale RRSIGs are obvious and it seems like a stale server could be a non-event, if the resolver is allowed to retry. <br></div><div dir="auto"><br></div><div dir="auto">Gavin</div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto"><br></div><div class="gmail_quote" dir="auto"></div></div>
</div>