[dns-operations] Experiences with a post 2019 Flag Day Resolver

Mon Sep 16 18:21:20 UTC 2019

Hi folks,

I haven't seen too much discussion here about operational experiences with
post 2019 DNS Flag Day resolvers, so I thought I'd share ours. Would be
interesting to hear from others on this topic.

We recently upgraded some of our resolvers to BIND 9.14.x. Soon after, we
started getting complaints about numerous sites unable to be resolved (the
response from the resolver to clients is SERVFAIL). We assumed this was
related to post flag-day non-workarounds for broken authoritative servers.
Hence, we expected these sites to also fail on software deployed by other
flag day participants, namely Google Public DNS, Cloudflare, Quad9,
OpenDNS, and recent versions of Unbound, PowerDNS, and Knot. But the sites
resolved fine on all these platforms (actually I didn't get around to
testing Knot yet, since the few minutes I devoted to figuring out how to
use build tools unfamiliar to me like Ninja/Meson wasn't enough - will try
to learn them later).

Some quick debugging revealed that this is because BIND sends outbound
queries with DNS cookies by default, and none of the other implementations
do. These non-resolving sites don't answer any queries with cookies. We
tested sending a variety of other EDNS options to them, and the "non
response" behavior is also the same. But all of them do respond to EDNS
enabled queries containing no options.

(BIND of course has been using cookies by default in earlier versions too,
but presumably they had the workaround behavior of retrying without them on
non-response).

The proportion of these sites in comparison to the total population of
zones that our resolvers talk to, is small, but not trivial. We have
attempted to contact the zone owners in question as we discover them,
pointing them to the various DNS compliance testing tools/sites. But this
was getting to be burdensome enough that we ended up turning off outbound
cookies ("send-cookie no;" in the global options).

Google Public DNS sends the EDNS Client Subnet option to authority servers
that we run, and presumably to those broken servers too. We cannot observe
the conversation between Google and the broken sites, but since they
resolve, we assume that they might at least have a workaround to retry such
sites without ECS (or maybe a dynamically maintained ECS blacklist is in
use). Perhaps, a Google Public DNS operator can confirm or disconfirm this.

--
Shumon Huque
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20190916/d049cd2a/attachment.html>