[dns-operations] resolver cache question

Fri Nov 13 22:55:21 UTC 2020

Mark Allman wrote:
> I just finished reading a paper that basically tries to figure out
> if a hostname is worth caching or not [1].  This isn't the first
> paper like this I have read.  This sort of thing strikes me as a
> solution in search of a problem.  The basic idea is that there are
> lots of hostnames that are automatically generated---for various
> reasons---and only ever looked up one time.  Then there is an
> argument made that these obviously clog up resolver caches.
> Therefore, if we can train a fancy ML classifier well enough to
> predict these hostnames are ephemeral and will only be resolved the
> once---because they are automatically generated and so have some
> tells---then we can save cache space (and effort) by not caching
> these.
> 
>   - My first reaction to the notion of clogging the cache is always
>     to think that surely some pretty simple LFU/LRU eviction policy
>     could handle this pretty readily.  But, that aside...
> 
>   - I wonder how much this notion of caches getting clogged up
>     really happens.  Could anyone help with a clue?  How often do
>     resolvers evict entries before the TTL expires?  Or, how much
>     over-provisioning of resolvers happens to accommodate such
>     records?  I know resolver caching helps [2], but I always feel
>     like I really know nothing about it when I read papers like
>     this.  Can folks help?  Or, point me at handy references?
> 
> [1] https://www.sciencedirect.com/science/article/abs/pii/S1389128620312627
> [2] https://www.icir.org/mallman/pubs/All20b/

This is a network security paper, not a systems engineering paper. The
authors are primarily concerned with the storage requirements of systems
that store DNS data well beyond the nominal TTL of the RRsets, which
they want to reduce by filtering out "disposable" records.

>From the paper:

    … redundant disposable domain names will bring unnecessary storage
    overhead to the passive DNS (pDNS) database and affect its I/O
    speed.

    … and the storage space of pDNS database can be reduced by 34%.

    … disposable domains seriously affect the construction of pDNS
    database. … But the disposable domains are undoubtedly redundant for
    pDNS database. In particular, if we store the whole-day distinct
    domain records into a key–value-store database, then about 34% of
    the memory space will be wasted by disposable domains, which will be
    demonstrated in Section 6.4. In other words, disposable domains
    dramatically increase the storage burden of pDNS database and
    severely compromise its efficiency.

    … As the time span increases, the proportion of disposable domains
    in full traffic is gradually increasing. In the 5-minute-scenario,
    on average, only 14% of the entries in each pDNS database are
    related to disposable domains. In the hour-scenario, about 21% of
    the entries in each pDNS database are related to disposable domains.
    But in the day-scenario, this ratio has risen to 34%. In other
    words, with the growth of database size (especially in the time
    dimension), the proportion of disposable domains increases as well.

As far as detecting whether "clogging" of DNS resolver caches occurs, it
probably would make sense to have resolvers increment a "premature
expiration" counter when they expire cache entries prior to the
scheduled TTD of the cached RRset or cached message. This would allow
operators to provide feedback to resolver developers as to whether this
behavior is actually occurring in the real world. (Looking at the
unbound-control manpage I don't think unbound has such a metric, not to
pick on unbound.)

If resolver caches are clogging due to cache entries that are only ever
used once, there is a really simple tweak to the LRU algorithm:

https://en.wikipedia.org/wiki/Cache_replacement_policies#Segmented_LRU_(SLRU)

    SLRU cache is divided into two segments, a probationary segment and
    a protected segment. Lines in each segment are ordered from the most
    to the least recently accessed. Data from misses is added to the
    cache at the most recently accessed end of the probationary segment.
    Hits are removed from wherever they currently reside and added to
    the most recently accessed end of the protected segment. Lines in
    the protected segment have thus been accessed at least twice. The
    protected segment is finite, so migration of a line from the
    probationary segment to the protected segment may force the
    migration of the LRU line in the protected segment to the most
    recently used (MRU) end of the probationary segment, giving this
    line another chance to be accessed before being replaced. The size
    limit on the protected segment is an SLRU parameter that varies
    according to the I/O workload patterns. Whenever data must be
    discarded from the cache, lines are obtained from the LRU end of the
    probationary segment.

Put a limit on the size of the probationary segment (e.g. the
probationary segment of the cache can only occupy 10% of the total
amount of space allocated to the cache) and you're done. This is
probably simpler to implement for existing DNS resolver implementations
than even a Bloom filter and certainly simpler than any schemes based on
detecting which domains are "disposable".

-- 
Robert Edmonds