[dns-operations] DNS error reporting (was: DNS at FOSDEM 2016)

Wed Feb 10 11:19:09 UTC 2016

Evan & Marek,

At 2016-02-09 17:21:42 +0000
Evan Hunt <each at isc.org> wrote:

> On Tue, Feb 09, 2016 at 11:21:35AM +0000, Marek Vavruša wrote:
> > I'd be happy to help as this is something I see as really important.
> > Here's what I think:
> > 
> > - The code system with predefined code numbers makes a lot of sense.
> > Personally, I would keep spaces between main error codes to make error
> > classification easier later. For example 1XX would be server-internal
> > failures (resource constraints, server-is-reloading state ...), 7XX
> > would be signature-related failures etc. Like HTTP error codes really.
> > This is what machines would like to see.  
> 
> The classification system in draft-hunt-dns-server-diagnostics-00 
> might be a starting place for this?  It's broken into internal server
> errors, general DNS errors, and DNSSEC errors.

Definitely a good start.

I like Marek's observation about HTTP codes. Using 1XX and the like
makes it easy for humans to spot the general sort of error, so probably
this should be adopted (rather than the power-of-two proposal in the
current draft, which would be awesome for people on the autism spectrum
but impossible to remember for most mortals). ;)

> > - Each code SHOULD have a free-form explanation on what actually
> > happened. This is what humans want to see.  
> 
> Optional supplemental text is part of the proposed ESD option as
> well.

Yes, absolutely.

Probably we should at least consider which language to use and how to
encode it. I'd be perfectly happy to define this as UTF-8 and say that
the operator can use whatever languages they want for the message.

Still, possibly some people would want multiple languages. In Holland
we'd likely use Dutch + English. In the US, some places would want
English + Spanish. In Belgium, French + Dutch + German + English (!!!).
It might make sense to (optionally?) include an ISO 639 language code
for each message so errors could be presented in a rational way to the
user.

> > - EDNS is okay, but the drawback is that diagnostic tools like
> > kdig/drill/dig or anything similar won't be able to interpret them and
> > it won't probably live through forwarder hops. I'm inclined towards
> > RFC4892 TXT in reserved name reporting (chaos class or not), as it's
> > simple, existing tools will be able to parse and display it and it's
> > flexible enough. Something like:
> > 
> > error.server. CH TXT "701"
> > error.server CH TXT "Signature of abcd.is expired 30 days ago."  
> 
> This is an interesting idea, but it would mean keeping state in
> the server for the errors that have been reported recently, rather
> than delivering the diagnostic information along with the SERVFAIL
> response itself.  "error.server" only works if there's only been
> a single error; you'd have to identify which error you're asking
> about (e.g. <qname>.<qtype>.<qclas>s.<txid>.error.server/CH/TXT)
> and keep a rolling list of answers on hand.

My understanding is that these records would be returned in the
additional section at the time of query, so nobody would have to keep
any state. It's identical to the EDNS approach, just using TXT records.

If you really wanted to provide information about multiple errors, a
slight modification could fix this:

0.error.server CH TXT "code: 701"
0.error.server CH TXT "info: Signature of abcd.is expired 30 days ago."
1.error.server CH TXT "code: 666"
1.error.server CH TXT "info: NSEC3 iterations 200, 150 is maximum for 1024 bit ZSK"

I'm not really in favor of this approach, but it is reasonable.

A final point... Evan's draft is clearly focused on the stub-resolver
to validating resolver interaction, but I think extended server
diagnostics are of general utility. Getting SERVFAIL is always a
frustrating situation, whether from recursive or authoritative servers.

--
Shane