[dns-operations] DNS .com/.net resolution problems in the Asia/Pacific region
ietf-dane at dukhovni.org
Wed Jul 12 03:50:50 UTC 2023
On Tue, Jul 11, 2023 at 10:51:47PM -0400, Viktor Dukhovni wrote:
> In .COM CZDS zone file snapshot of .COM from ~midnight UTC 2023-07-11
> the range of non-apex RRSIG inception times was:
> 20230707025004 – 20230710225021
> With corresponding expiration times:
> 20230714040004 – 20230718000021
> With expiration of the oldest RRSIGS 3 days and 4 hours away, and the
> newest a full 7 days.
Apart from some records that are signed intra-day, the expiration times
of records in .COM are strongly clustered around once a day signing
events that cover roughly 25% of the zone. For example, the CZDS
snapshot for the 11th has expiration times clustered near:
2023-07-14T04:00 ~3.4M RRsets
2023-07-15T04:00 ~3.4M RRsets
2023-07-16T04:00 ~3.4M RRsets
2023-07-17T04:00 ~3.4M RRsets
So the affected delegations would have been ~0%, ~25%, ~50%, ~75% or
~100% of the zone, depending on how many days the issue went unnoticed.
Finally, if, e.g., Verisign were to support an automated error reporting
channel for .COM/.NET, would it look like the current error reporting
Or would it have to be something different?
The monitoring agent would ideally be different at each server cluster
sharing a common code base and zone database. (Salted with something
akin to an NSID). So that the report identifies the problem instance.
Just knowing that some .COM delegations somewhere look expired, is not
nearly as useful as knowing exactly where.
Is it reasonable to include the error reporting channel signal with
every query response, or (law of large numbers) is it sufficient to
include it some small enough fraction of the time, to make it less
instusive, and yet frequent enough that a errors that happen often
enough will soon be noticed.
What other tweaks should error reporting include? (Explicit transport
protocol hints? ...)
More information about the dns-operations