[dns-operations] post-mortem for ripe.net DNSSEC problem on 1 November 2023
Paul de Weerd
pdeweerd at ripe.net
Thu Nov 2 15:43:43 UTC 2023
Dear colleagues,
Please find below the post mortem for the DNSSEC problem that caused
most of RIPE NCC's services to become unavailable yesterday.
Please reach out if you have any questions or feedback.
Thanks,
Paul de Weerd
Manager Global Information Infrastructure team
RIPE NCC
Summary
On 1 November, from 10:45 to 12:15 UTC, most names in the ripe.net zone
were bogus due to expired DNSSEC signatures being served. This rendered
most of the RIPE NCC’s services unreachable. After investigating the
issue, we found a typo in a change to our zone where a record had a TTL
that was longer (864,000 seconds instead of 86,400) than the refresh
interval for RRSIGs (seven days). This caused our signer to stop
refreshing signatures and only sign changes to the zone. We are talking
to the vendor of our DNSSEC signing solution about this case to see what
can be improved on that end, have implemented a pre-commit check to
prevent TTLs longer than a day in the ripe.net zone and are looking at
improving monitoring for stale signatures to spot issues like this
before they cause problems.
Impact
DNSSEC signatures in the ripe.net zone are valid for 14 days, with our
signers configured to resign them after half that time (seven days). On
1 November at 10:45 UTC the signature on several records in the ripe.net
zone expired. These records had last been signed on 18 October and were
due to be re-signed on the 25th. However, due to a problem with the TTL
on one record, our signer stopped re-signing records in the zone on 25
October. This resulted in the expiry of 11,026 out of 11,389 records on
1 November. New or changed records were still properly signed (363 of
them), which meant that our monitoring, which checks the signature
validity of the SOA record at the zone apex, missed this issue.
Because our internal resolvers are configured for DNSSEC validation, the
impact was rather immediate for staff, as many internal services broke
due to this issue. After first dismissing some alternative causes, we
quickly found the problem was with expired signatures in the ripe.net
zone, so we turned our attention to our signers. At the same time, we
temporarily disabled DNSSEC validation on our internal resolvers so we
could more easily access our own systems while troubleshooting.
Resolution
While debugging, we found that the rrsig-refresh option that we
configured to seven days (half the value of the rrsig-lifetime option of
14 days) was likely involved, logs showed:
info: [ripe.net.] DNSSEC, signing zone
error: [ripe.net.] DNSSEC, rrsig-refresh too low to prevent expired
RRSIGs in resolver caches
info: [ripe.net.] DNSSEC, next signing at 2023-10-25T10:02:02+0000
error: [ripe.net.] zone event 're-sign' failed (invalid parameter)
At 12:14 UTC we removed that option from our configuration and we could
sign the zone again. The freshly signed zone was pushed out and went
live a little bit later, which meant that at 12:15 UTC our services were
available again for most users. Unfortunately, some users kept seeing
problems for several hours after we restored the signatures.
Root cause
After further investigation we found that the change that triggered this
problem introduced a record in the ripe.net zone with a TTL of 864,000
(ten days). Because this TTL is longer than our rrsig-refresh
configuration, this could lead to cases where a resolver’s cache
contains the record with an expired signature. The signer software
rightfully complained about this. We were surprised to find it then
stopped refreshing signatures for all records in the zone that didn’t
change.
Future steps
During the incident and the aftermath we identified a few changes that
we want to make to improve the resiliency of our setup and allow us to
find cases like these before they become problems. Our current RRSIG
freshness monitoring did not catch this case, because the records we
monitor still had valid and recent signatures, so we are considering
what we can do to cover this situation. We have also improved our
zone-editing pipeline to catch typos or misconfigurations for TTL values.
Next to that, the problem also affected our ability to communicate
internally, as our internal chat system was unresolvable too. We have
some means of out-of-band communication, but will review how we can
improve that.
Additionally, while the status.ripe.net website is hosted on separate
infrastructure, the fact that it is also in the ripe.net domain meant
that it was just as unreachable as our other services. We will evaluate
this approach and see how we can improve on it.
Timeline (times in UTC)
25 October
08:52 a record was added to the ripe.net zone with a TTL of 10 days
08:53 knot incrementally signs ripe.net successfully
09:02 knot fails to sign the ripe.net zone for the first time
1 November
10:45 ripe.net signatures expire and many records go bogus
11:27 DNSSEC validation on internal resolvers was disabled
12:14 changed configuration and manually re-signed zone
12:15 ripe.net zone has new valid signatures
12:38 DNSSEC validation on internal resolvers is re-enabled
2 November
08:39 typo in TTL fixed, bringing it back to 86,400 seconds as intended
08:39 added check in pipeline to detect too large TTL values
More information about the dns-operations
mailing list