[dns-operations] DNS error reporting

Tue Feb 16 19:14:03 UTC 2016

On 2/16/16, 16:28, "dns-operations on behalf of Andrew Sullivan"
<dns-operations-bounces at dns-oarc.net on behalf of ajs at anvilwalrusden.com>
wrote:
>
>That depends on you reaching the server that gave you the error, and
>in anycast arrangements you don't have any way to ensure that.

This brings up a good point ancillary to this.

As one can imagine, frustration over SERVFAIL is not new.  We've been
around the issue before.  When issues resurface, it's because they are
innately difficult.  Perhaps in the past the solutions have attacked the
wrong "face of the mountain" and perhaps no one has stumbled across the
simple fix, but anyone can bet that a lot of sharp minds have thought
about this.

The "good point" is that we are tempted to rely on "asking the server for
more information" in quite a few instances.  As operations of DNS grow in
complexity, that tactic becomes less reliable as a querier never quite
knows what server-process answered.

The thought above also is why I suggested answering with "what is
recommended" rather than an explanation.  There's another angle to
consider - what "explanations" are useful to the querier and what are not.

As an operator of a node, I already have far more information about what
is happening in a process.  I have logs, I have the ability to debug, I
have knowledge of the architecture.  That's my business, not anyone
else's.  The usual problem in operations is not needing more information
it is first, knowing what to investigate and second, putting the pieces
together.

When a SERVFAIL is returned, I see three categories - what to do next,
what I'd tell the querier went wrong, and what I'd tell the operator went
wrong.  I mentioned before writing an experimental validator in the early
part of the development process with about a dozen major categories and
somewhere around 70 or 80 minor codes.  One thought I had then was to
include a means to send SNMP traps to the NOC, at the time SNMPv3 was
being developed in another part of the labs, those traps would alert the
NOC to things like possible time offset problems and things that the
operator could work on.  That idea was dismissed long ago, we never tried
it.

The SNMP fever never took hold.  And that is the reason I suggested
focusing on recommendations for retries and not isolating the fault.  I
should add - yes, you can tell something by "where in the code" the
SERVFAIL is generated but I bet that it isn't very conclusive.  E.g., when
a signatures's inception date hasn't happened yet, is it that the signer's
clock was off or the validator's clock is off?  Same "if" statement, two
different solution paths.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4604 bytes
Desc: not available
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20160216/2381fc3f/attachment.bin>