[dns-operations] anycast ops willing to review a IETF draft?
klaus.mailinglists at pernau.at
Sat Mar 23 22:18:41 UTC 2019
I am sending a second email as I accidentialy sent the first email to early.
> 6. R5: Consider longer time-to-live values whenever possible
> In a DNS response, each resource record is accompanied by a time-to-
> live value (TTL), which "describes how long a RR can be cached before
> it should be discarded" [RFC1034]. The TTL values are set by zone
> owners in their zone files - either specifically per record or by
> Moura, et al. Expires September 12, 2019 [Page 9]
> Internet-Draft Recomm-Authoritative-Ops March 2019
> using default values for the entire zone. Sometimes the same
> resource record may have different TTL values - one from the parent
> and one from the child DNS server. In this case, resolvers are
> expected to prioritize the answer according to Section 5.4.1 in
> While set at authoritative servers, (ATn in Figure 1), the TTL value
> in fact influences the behavior of recursive resolvers (and their
> operators - "Re_n" in the same figure), by setting an upper limit on
> how long a record should be cached before discarded. In this sense,
> caching can be seen as a sort of "ephemeral replication", i.e., the
> contents of an authoritative server are placed at a recursive
> resolver cache for a period of time up to the TTL value. Caching
> improves response times by avoiding repeated queries between
> recursive resolvers and authoritative.
> Besides improving performance, it has been argued that caching plays
> a significant role in protecting users during DDoS attacks against
> authoritative servers. To investigate that, [Moura18b] evaluates the
> role of caching (and retries) in DNS resiliency to DDoS attacks. Two
> authoritative servers were configured for a newly registered domain
> and a series of experiments were carried out using various TTL values
> (60,1800, 3600, 86400s) for records. Unique DNS queries were sent
> from roughly 15,000 vantage points, using RIPE Atlas.
> [Moura18b] found that, under normal operations, caching works as
> expected 70% of the times in the wild. It is believed that complex
> recursive infrastructure (such as anycast recursives with fragmented
> cache), besides cache flushing and hierarchy explains these other 30%
> of the non-cached records. The results from the experiments were
> confirmed by analyzing authoritative traffic for the .nl TLD, which
> showed similar figures.
> [Moura18b] also emulated DDoS attacks on authoritative servers by
> dropping all incoming packets for various TTLs values. For
> experiments when all authoritative servers were completely
> unreachable, they found that the TTL value on the DNS records
> determined how long clients received responses, together with the
> status of the cache at the attack time. Given the TTL value
> decreases as time passes at the cache, it protected clients for up to
> its value in cache. Once the TTL expires, there was some evidence of
> some recursives serving stale content [I-D.ietf-dnsop-serve-stale].
> Serving stale is the only viable option when TTL values expire in
> recursive caches and authoritative servers became completely
> They also emulated partial-failure DDoS, i.e., DDoS that cause
> authoritative to respond to be able to respond part of the queries
> Moura, et al. Expires September 12, 2019 [Page 10]
> Internet-Draft Recomm-Authoritative-Ops March 2019
> (similar to Dyn 2016 [Perlroth16]). They emulate such scenario by
> dropping incoming packet at rates of 50-90%, for various TTL values.
> They found that:
> o Caching was a key component in the success of queries. For
> example, with a 50% packet drop rate at the authoritatives, most
> clients eventually got an answer.
> o Recursives retries was also a key part of resilience: when caching
> could not help (for a scenario with TTL of 60s, and time in
> between probing of 10 minutes), recursive servers kept retrying
> queries to authoritatives. With 90% packet drop on both
> authoritatives (with TTL of 60s), 27% of clients still got an
> answer due to retries, at the price of increased response times.
> However, this came with a price for authroritative servers: a 8.1
> times increase in normal traffic during a 90% packet drop with TTL
> of 60s, as recursives attempt to resolve queries - thus
> effectively creating "friendly fire".
> Altogether, these results help to explain why previous attacks
> against the Roots were not noticed by most users [Moura18b] and why
> other attacks (such as Dyn 2016 [Perlroth16]) had significant impact
> on users experience: records on the Root zone have TTL values ranging
> from 1 to 6 days, while some of unreachable Dyn clients had TTL
> values ranging from 120 to 300s, which limit how long records ought
> to be cached.
> Therefore, given the important role of the TTL on user's experience
> during a DDoS attack (and in reducing ''friendly fire''), it is
> recommended that DNS zone owners set their TTL values carefully,
> using reasonable TTL values (at least 1 hour) whenever possible,
> given its role in DNS resilience against DDoS attacks. However, the
> choice of the value depends on the specifics of each operator (CDNs
> are known for using TTL values in the range of few minutes). The
> drawback of setting larger TTL values is that changes on the
> authoritative system infrastructure (e.g.: adding a new authoritative
> server or changing IP address) will take at least as long as the TTL
> to propagate among clients.
I think it is also useful to avoid dependencies on other zones. IE.
using in bailiwick name servers reduces dependiencies on other zones and
the parent zone server glue records avoiding additional lookups.
Feel free to ask for more comments of if my comments are confusing.
More information about the dns-operations