[dns-operations] Full-service resolver - Pending Upstream Query behaviour

Mukund Sivaraman muks at mukund.org
Tue Oct 5 20:39:31 UTC 2021


On Tue, Oct 05, 2021 at 11:04:14AM -0700, Paul Vixie wrote:
> 
> 
> Frederico A C Neves wrote on 2021-10-05 09:01:
> > ...
> > 
> > Anyway I think that even though the incident was not DNS related "We",
> > as the DNS community, could probably do better in future events.
> > 
> > I would like to start a discussion or to hear implenters and operators
> > of Full-service resolvers on what would be the best software
> > architecture or best current configuration practice to handle a
> > traffic pattern when a very popular name enters a scenario were all
> > the auth-servers are timing-out or network unreachable.

Some BIND derivatives such as ours (I don't know if ISC BIND retained
such a patch) have a holddown timer feature which caches timeouts when
contacting upstream nameservers and backs off for a bit when the server
continues to remain unreachable.

There's also a downstream servfail cache in the nameserver path of newer
versions of BIND.

Some resolvers also have implemementations of the serve-stale(TTL) RFC,
and the Unbound-like behavior of answering expired answers from cache
first even before attempting resolution. We will be checking what the
effect of this was during the Facebook outage, at least for DNS answers
from cache.

> was cache miss deduplication by q-tuple ever standardized? it is a nec'y
> part of kaminsky resistance and so ought to be part of whatever BCP corpus
> comes about. pending upstream query behaviour would be an expansion on cache
> miss dedup by q-tuple, such that a rising tide of timeouts would yield
> probabilistic prediction of servfail for cache misses aimed at the affected
> <zone,auth>.

Due to a transient bug in NIOS BIND that unfortunately entered
operations, we found the hard way what the cost of not de-duplicating
cache miss -> resolutions was. It's not just that the resolver chokes,
but it also floods upstream nameservers and becomes a bad internet
citizen.  We fixed it and the following draft was written (section 3.1):

https://datatracker.ietf.org/doc/html/draft-muks-dnsop-dns-thundering-herd-00

IIRC, when this draft was published, a reviewer mentioned that the
de-duplication of upstream queries behavior is also recommended in some
other RFC already for different reasons.

		Mukund

> 
> in 2003, i implemented this as a form of negative caching, where the
> negativity spectrum included timeouts, refuseds, and servfails -- not just
> nxdomain. this worked well but needed refinement and the implementation was
> not open-source. so, you and i with rodney joffe published "resimprove"
> containing some of these ideas, but it has taken some decades to get these
> accepted.
> 
> i hope you succeed in this rekindling, and i would join any such effort.
> when it comes to authority dns responses to cache miss transactions, recent
> nonperformance is an excellent reason to predictively fail rather than
> packing good on top of bad. distributed state can be treated as a mass-like
> quantity, so that its inertia can be conserved at design time.
> 
> vixie
> 
> -- 
> Sent from Postbox <https://www.postbox-inc.com>

> _______________________________________________
> dns-operations mailing list
> dns-operations at lists.dns-oarc.net
> https://lists.dns-oarc.net/mailman/listinfo/dns-operations
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20211006/8b6483a3/attachment.sig>


More information about the dns-operations mailing list