Google Public DNS & DNS Flag Day 2020

Mon Oct 11 18:21:05 UTC 2021

Google Public DNS is a large open resolver, available to the public.  DNS
Flag Day 2020 ( https://dnsflagday.net/2020/ ) was a coordinated
internet-wide effort, whereby DNS operators agreed to set the EDNS buffer
size parameter in outgoing queries with the goal of limiting IP
fragmentation and thereby improving the overall reliability and performance
of the global DNS.  This is a summary of how Google Public DNS participated
and what we learned through the experience.
Participation

Being a large open resolver, we generate outgoing queries to many DNS
servers across the world, and across many heterogeneous networks.  Large
response packets can sometimes be fragmented, and delivery of fragments is
generally less reliable.  When a UDP response packet is not delivered, a
client can only wait and give up.

Waiting for an outgoing UDP query to time out consumes a significant
portion of the overall deadline that we set for the iterative resolution
process, contributing towards an overall failure for the query.  In other
words, for us with large responses the small penalty of receiving a
truncation response and retrying over TCP is better than the large penalty
of completely timing out (and then retrying anyway).

While difficult to measure accurately (the only signal is the lack of a
response, but we can never know which missing responses were due to dropped
fragments), we had past anecdotal evidence of a higher than average failure
rate for queries with large responses.  These tended to involve queries for
the DNSKEY and TXT record types.

Having observed such failures, we were interested in participating in the
flag day as a path towards better service.  As previously discussed (
https://youtu.be/CHprGFJv_WE ) we evaluated the suggested (
https://dnsflagday.net/2020/#message-size-considerations ) 1232 byte and an
alternative 1400 byte limit, and we selected the latter.  (We may
reconsider these specific values over time, including configuring IPv4 and
IPv6 separately.)
Release

In order to monitor impact as we released this change, and to provide a
path to roll back, we used an experiment approach.  We defined an
experiment rate (percentage) which applied randomly to each outgoing
query.  We set this rate very small and increased it gradually over time.
At each step we measured a variety of signals including UDP timeouts, TCP
retries, and truncations.  We compared the baseline queries to those in the
experiment, and generally found results that aligned with our expectations:
rates of truncations rose and timeouts fell.
Timeline

Date Experiment %

2020-09-10 1%

2020-10-08 2%

2020-10-22 5%

2020-11-05 15%

2020-11-19 50%

2021-01-21 100%
Problem

We first set the experiment rate to 100% in January of 2021.  Until that
point all signals looked good, and had for months to that point before
reaching 100%.  At a full 100% bufsize-limited query rate we exacerbated a
latent issue in the system. Specifically: in the case where UDP truncation
forces TCP retry and the TCP query also fails, we might mishandle this
response, caching the (empty, truncated) UDP response for too long.  (A
primary indicator: some Google systems began to have persistent issues
looking up some large email related TXT records when these queries always
followed the TCP retry path.)

At this point we reverted back to a low experiment rate, and started the
process of identifying and fixing the underlying problem in our service.
Once that was done we first did a geographically targeted experiment, then
proceeded through another gradual increase in experiment rate across the
globe.
Timeline 2

Date Experiment %

2021-02-18 50%

2021-04-15 50% + 100% in one small geography

2021-04-29 50% + 100% in two small geographies

2021-05-13 75%

2021-05-27 90%

2021-06-10 100%
Results

We’ve been specifying a buffer size of 1400 for all outgoing queries (when
they include EDNS) since June 2021.  Though we took a while to get here
thanks to the need to roll back and fix an internal bug, we’ve been stable
this way for several months.

We’ve observed no other problems caused by the change, while also
anecdotally seeing issue reports involving queries with large response
sizes disappear.  Since only a small amount of traffic (under half a
percent) involves large enough responses to be affected, the absolute
numbers are small but as predicted we see an increase in UDP truncated
responses, and a decrease in UDP timeouts.

We’re happy with the change and expect to keep it for the long term.
However we are tracking the IETF DNS operations working group draft about
fragmentation (
https://datatracker.ietf.org/doc/html/draft-ietf-dnsop-avoid-fragmentation
).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20211011/e5e08d9a/attachment.html>