[dns-operations] Today's Problem: repo.fpki.gov

Thu Oct 20 00:49:47 UTC 2022

> On 19 Oct 2022, at 17:35, cjc+dns-oarc at pumpky.net wrote:
> 
> Thanks for the interesting discussion on the qa.ws.iqt.fiscal.treasury.gov problem. Nice to know I'm not the only one who doesn't quite understand why we're getting mixed results despite the obviously non-compliant behavior.
> 
> But got a new one today. Different failure mode, but same thing. Sometimes works, but sometimes SERVFAIL.
> 
> Noticed when we started to get some server-admin-heartburn when CRL downloads started to fail because of DNS errors. The servers for the zone fpki.gov are handing out different DNSKEYs (if the server responds at all). The DNSviz for this one pretty clearly catches the problem, but you may need a screen magnifier,
> 
> https://dnsviz.net/d/repo.fpki.gov/Y08v3Q/dnssec/
> 
> You can read all of the "Errors" on the left. (BTW, this zone was _completely_ broken for a while this evening. The auth servers appeared down. Thought they might have been trying to fix this, but looks like it's still there.)
> 
> I thought we might have caught it midway through in a bad rollover, but it's been this way for a while and the SOAs on all of the servers match.
> 
> So it's pretty easy to see how something could break. If a resolver gets the DNSKEYs from a server with ones that don't match the RRSIGs you've got, you can't validate.
> 
> But here's my question, are DNS resolvers, and specifically, BIND, forgiving enough to try other authoritative servers for missing DNSKEYs for cases just like this? Will they searching other authoritative servers in search of a matching DNSKEY.

BIND will look for a DNSKEY RRset that validate as secure.  It will then cache it. 

> Or can they come at it from the other way? If the RRSIGs don't line up with the available DNSKEYs, the server doesn't cache these target RRsets and the resolver makes another try, possibly to a different server.

BIND will query other servers.  All recursive nameservers should do this as
bogus answers are to be treated as if they have not arrived for CD=0 queries.

> But even if resolvers do this stuff, I think I still see how this could break things. If a recursive resolver is doing "forward-only" through another caching resolver, the end resolver will only get whatever the forwarder has in its cache. If the middle resolver has incompatible or incomplete DNSKEYs and RRSIGs, there isn't a way for the end resolver to force the intermediate resolver to go out and get DNSKEYs from the other authoritative servers for the zone.

This is why “ Always Set the CD Bit on Queries" is stupid.  The intermediate
servers need to validate responses so that downstream validators get good put. 

RFC 6840

5.9.  Always Set the CD Bit on Queries

   When processing a request with the Checking Disabled (CD) bit set, a
   resolver SHOULD attempt to return all response data, even data that
   has failed DNSSEC validation.  Section 3.2.2 of [RFC4035] requires a
   resolver processing a request with the CD bit set to set the CD bit
   on its upstream queries.

   This document further specifies that validating resolvers SHOULD set
   the CD bit on every upstream query.  This is regardless of whether
   the CD bit was set on the incoming query or whether it has a trust
   anchor at or above the QNAME.

   [RFC4035] is ambiguous about what to do when a cached response was
   obtained with the CD bit unset, a case that only arises when the
   resolver chooses not to set the CD bit on all upstream queries, as
   specified above.  In the typical case, no new query is required, nor
   does the cache need to track the state of the CD bit used to make a
   given query.  The problem arises when the cached response is a server
   failure (RCODE 2), which may indicate that the requested data failed
   DNSSEC validation at an upstream validating resolver.  ([RFC2308]
   permits caching of server failures for up to five minutes.)  In these
   cases, a new query with the CD bit set is required.

   Appendix B discusses more of the logic behind the recommendation
   presented in this section.

The problem is that the "sometimes set” model is wrong in the described
behaviour and from that the wrong conclusions are drawn.  DNSSEC was
designed with send CD=0 unless the triggering query had CD=1 and to never
return previous CD=1 results without validating them first.  What is missing
from the DNSSEC RFCs is instructions to retry with CD=1 when you get
SERVFAIL from the upstream recursive server to a CD=0 query.  The retry
with CD=1 lets you work around bad time and bad trust anchors in upstream
servers.

The desire to reduce the work performed by intermediate servers results in
a system that does not work when the servers are under attack or when there
are stuff ups with the administration of the zone.  The bad answers make it
through and the client has no way to recover.

> Does that scenario make sense? I've been dumping caches and trying to see what the server is doing when things are working and when they are not, but thought I'd just try the people with the deep resolver knowledge.
> 
> But I /really/ just wish .gov orgs would fix their @*%$ DNSSEC!

Yep. 

> _______________________________________________
> dns-operations mailing list
> dns-operations at lists.dns-oarc.net
> https://lists.dns-oarc.net/mailman/listinfo/dns-operations

-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742              INTERNET: marka at isc.org