[dns-operations] resolver cache question
muks at mukund.org
Sat Nov 14 05:42:39 UTC 2020
On Fri, Nov 13, 2020 at 12:41:54PM -0500, Mark Allman wrote:
> I just finished reading a paper that basically tries to figure out
> if a hostname is worth caching or not . This isn't the first
> paper like this I have read. This sort of thing strikes me as a
> solution in search of a problem. The basic idea is that there are
> lots of hostnames that are automatically generated---for various
> reasons---and only ever looked up one time. Then there is an
> argument made that these obviously clog up resolver caches.
> Therefore, if we can train a fancy ML classifier well enough to
> predict these hostnames are ephemeral and will only be resolved the
> once---because they are automatically generated and so have some
> tells---then we can save cache space (and effort) by not caching
> - My first reaction to the notion of clogging the cache is always
> to think that surely some pretty simple LFU/LRU eviction policy
> could handle this pretty readily. But, that aside...
> - I wonder how much this notion of caches getting clogged up
> really happens. Could anyone help with a clue? How often do
> resolvers evict entries before the TTL expires? Or, how much
> over-provisioning of resolvers happens to accommodate such
> records? I know resolver caching helps , but I always feel
> like I really know nothing about it when I read papers like
> this. Can folks help? Or, point me at handy references?
A metric that large resolver operators usually watch is the
cache-hit-rate (CHR). A high CHR suggests a resolver's cache is helping
avoid work for the resolver. For non-attack traffic, the TTL-expiry and
LRU rdataset eviction strategies usually work well and it is unusual to
have to mitigate for non-attack traffic specially.
In recent years, there has been a form of traffic called random
subdomain attack or water torture attack, which led to negative cache
entries polluting a resolver's cache, and there have been mitigations
put in place to deal with them.
If there is a proposal that accurately predicts which cache entries will
be unused and can be evicted in preference, it would certainly help in
pruning the cache optimially and keeping the CHR high. Managing cache
size and watching CHR typically are things done by larger resolver
operators - usually these aren't problems in a small LAN's resolver
where defaults work ok. At a large scale, determining cache entries that
can be evicted from public DNS traffic isn't easy. Resolvers mainly
still use TTL expiry and LRU based evition, with small hacks for some
NSEC aggressive use (RFC 8198) also helps with some forms of random
subdomain attack traffic.
About how often resolvers evict entries before the TTL expires, and
related questions, these depend on the resolver site's traffic and it
can be temporal/fleeting, i.e., the answer to these questions can change
quickly at the same resolver site depending on changing traffic
patterns. Some of our customers' resolvers have caches at the maximum
memory limit for continuous periods (new cache entries would evict old
ones), and it is not necessarily a problem.
Large sites are provisioned to keep the CHR high and handle the expected
query rate. CHR and some other related metrics such as query rate,
average query response time, fetches in progress, receive queue lengths,
etc. are monitored, and the resolver is scaled accordingly. From my
experience, the problems due to traffic at different locations/times are
different and we diagnose problems by dumping various metrics and
comparing with traffic captures on a case by case basis. Legitimate
traffic including for names within generated sequences of hostnames does
not commonly cause problems for a properly provisioned resolver. It is
usually intended or unintended flood of various kinds of traffic
patterns that cause problems. Random subdomain queries are a pattern of
problematic traffic, but there are other patterns, intended and
unintended, that can cause problems for an implementation.
If generated hostnames were a significant problem, it would be an
ongoing concern. If someone proposes a working attack traffic idea, it
would be observed very quickly at busy public resolvers. I haven't come
across a pcap from a busy public resolver that didn't contain what
looked like the pattern of an attempted attack - resolvers receive
abusive traffic patterns daily, all the time. Good new ideas on
improving what to keep and what to evict from cache are welcome though.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 833 bytes
Desc: not available
More information about the dns-operations