[dns-operations] Error codes or next steps - was Re: DNS at FOSDEM 2016
fw at deneb.enyo.de
Tue Feb 16 22:36:50 UTC 2016
* Edward Lewis:
> On 2/12/16, 9:27, "dns-operations on behalf of Florian Weimer"
> <dns-operations-bounces at dns-oarc.net on behalf of fw at deneb.enyo.de> wrote:
>>It's particularly bad for the stub resolver because it's not clear if
>>you should try another name server if you receive a SERVFAIL response
>>from the first one.
> Interesting thought - should the error return indicate "what happened" or
> should it indicate "what you should do next."
> The above comment though might turn the problem space around. How about
> defining codes that tell a querier to "try again" or "try later" or "try
> another server" or "try a different authority" or "give up and go home."
> That is, ultimately, what the DNS system really needs (even if it make the
> GUI folks go begging for a reason to show the user).
It's certainly tricky. If your system uses two recursive resolvers,
both in the same data center, with reasonable monitoring, then it may
be preferable to return a SERVFAIL response to the application without
trying the other server (because it will likely face the same issue),
even if the immediate cause of the SERVFAIL is the inability to reach
any upstream server. The benefit is a better user experience because
the error is reported more quickly to the application, instead of
waiting for another timeout to happen.
On the other hand, if the system uses two servers in different data
centers, with somewhat different WAN routing, then one of the servers
might get an upstream response because it still has connectivity,
while the other one does not. And this applies to the same immediate
cause (authoritative server timeout).
For a load-induced SERVFAIL, switching servers might get you an
answer, but it might also make the load problem worse.
And you can get lucky and a different server might have the response
its cache, even if it fresh resolution would fail for it as well.
So I doubt there is a clear protocol-level answer.
But all in all, I'm currently leaning towards not switching servers on
SERVFAIL in the (potentially rather stateless) stub resolver.
More information about the dns-operations