[dns-operations] Experiences with a post 2019 Flag Day Resolver

Mon Sep 16 22:15:32 UTC 2019

> On 17 Sep 2019, at 4:21 am, Shumon Huque <shuque at gmail.com> wrote:
> 
> Hi folks,
> 
> I haven't seen too much discussion here about operational experiences with post 2019 DNS Flag Day resolvers, so I thought I'd share ours. Would be interesting to hear from others on this topic.
> 
> We recently upgraded some of our resolvers to BIND 9.14.x. Soon after, we started getting complaints about numerous sites unable to be resolved (the response from the resolver to clients is SERVFAIL). We assumed this was related to post flag-day non-workarounds for broken authoritative servers. Hence, we expected these sites to also fail on software deployed by other flag day participants, namely Google Public DNS, Cloudflare, Quad9, OpenDNS, and recent versions of Unbound, PowerDNS, and Knot. But the sites resolved fine on all these platforms (actually I didn't get around to testing Knot yet, since the few minutes I devoted to figuring out how to use build tools unfamiliar to me like Ninja/Meson wasn't enough - will try to learn them later).
> 
> Some quick debugging revealed that this is because BIND sends outbound queries with DNS cookies by default, and none of the other implementations do. These non-resolving sites don't answer any queries with cookies. We tested sending a variety of other EDNS options to them, and the "non response" behavior is also the same. But all of them do respond to EDNS enabled queries containing no options.
> 
> (BIND of course has been using cookies by default in earlier versions too, but presumably they had the workaround behavior of retrying without them on non-response).
> 
> The proportion of these sites in comparison to the total population of zones that our resolvers talk to, is small, but not trivial. We have attempted to contact the zone owners in question as we discover them, pointing them to the various DNS compliance testing tools/sites. But this was getting to be burdensome enough that we ended up turning off outbound cookies ("send-cookie no;" in the global options).

Do you have actual discover rates?
Did they slow over time?
Did you install server specific server clauses for those servers?
Did you go back and re-test the servers after a while?
Would you be willing to list the broken servers publicly?

My recursive server has 4 servers currently listed.  Of these 2 are now fixed.

        server 193.184.54.212 { send-cookie false; };
        server 88.208.234.46 { send-cookie false; }; // gatwickaviationsociety.org.uk
        server 46.163.66.150 { send-cookie false; }; // easyweb.dk
        server 202.142.133.116 { send-cookie false; }; // esupport.net.au

> Google Public DNS sends the EDNS Client Subnet option to authority servers that we run, and presumably to those broken servers too. We cannot observe the conversation between Google and the broken sites, but since they resolve, we assume that they might at least have a workaround to retry such sites without ECS (or maybe a dynamically maintained ECS blacklist is in use). Perhaps, a Google Public DNS operator can confirm or disconfirm this.
> 
> --
> Shumon Huque
> 
> _______________________________________________
> dns-operations mailing list
> dns-operations at lists.dns-oarc.net
> https://lists.dns-oarc.net/mailman/listinfo/dns-operations

-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742              INTERNET: marka at isc.org