[dns-operations] post-mortem for ripe.net DNSSEC problem on 1 November 2023

Thu Nov 2 15:43:43 UTC 2023

Dear colleagues,

Please find below the post mortem for the DNSSEC problem that caused 
most of RIPE NCC's services to become unavailable yesterday.

Please reach out if you have any questions or feedback.

Thanks,

Paul de Weerd
Manager Global Information Infrastructure team
RIPE NCC

Summary

On 1 November, from 10:45 to 12:15 UTC, most names in the ripe.net zone 
were bogus due to expired DNSSEC signatures being served. This rendered 
most of the RIPE NCC’s services unreachable. After investigating the 
issue, we found a typo in a change to our zone where a record had a TTL 
that was longer (864,000 seconds instead of 86,400) than the refresh 
interval for RRSIGs (seven days). This caused our signer to stop 
refreshing signatures and only sign changes to the zone. We are talking 
to the vendor of our DNSSEC signing solution about this case to see what 
can be improved on that end, have implemented a pre-commit check to 
prevent TTLs longer than a day in the ripe.net zone and are looking at 
improving monitoring for stale signatures to spot issues like this 
before they cause problems.

Impact

DNSSEC signatures in the ripe.net zone are valid for 14 days, with our 
signers configured to resign them after half that time (seven days). On 
1 November at 10:45 UTC the signature on several records in the ripe.net 
zone expired. These records had last been signed on 18 October and were 
due to be re-signed on the 25th. However, due to a problem with the TTL 
on one record, our signer stopped re-signing records in the zone on 25 
October. This resulted in the expiry of 11,026 out of 11,389 records on 
1 November. New or changed records were still properly signed (363 of 
them), which meant that our monitoring, which checks the signature 
validity of the SOA record at the zone apex, missed this issue.

Because our internal resolvers are configured for DNSSEC validation, the 
impact was rather immediate for staff, as many internal services broke 
due to this issue. After first dismissing some alternative causes, we 
quickly found the problem was with expired signatures in the ripe.net 
zone, so we turned our attention to our signers. At the same time, we 
temporarily disabled DNSSEC validation on our internal resolvers so we 
could more easily access our own systems while troubleshooting.

Resolution

While debugging, we found that the rrsig-refresh option that we 
configured to seven days (half the value of the rrsig-lifetime option of 
14 days) was likely involved, logs showed:

info: [ripe.net.] DNSSEC, signing zone
error: [ripe.net.] DNSSEC, rrsig-refresh too low to prevent expired 
RRSIGs in resolver caches
info: [ripe.net.] DNSSEC, next signing at 2023-10-25T10:02:02+0000
error: [ripe.net.] zone event 're-sign' failed (invalid parameter)

At 12:14 UTC we removed that option from our configuration and we could 
sign the zone again. The freshly signed zone was pushed out and went 
live a little bit later, which meant that at 12:15 UTC our services were 
available again for most users. Unfortunately, some users kept seeing 
problems for several hours after we restored the signatures.

Root cause

After further investigation we found that the change that triggered this 
problem introduced a record in the ripe.net zone with a TTL of 864,000 
(ten days). Because this TTL is longer than our rrsig-refresh 
configuration, this could lead to cases where a resolver’s cache 
contains the record with an expired signature. The signer software 
rightfully complained about this. We were surprised to find it then 
stopped refreshing signatures for all records in the zone that didn’t 
change.

Future steps

During the incident and the aftermath we identified a few changes that 
we want to make to improve the resiliency of our setup and allow us to 
find cases like these before they become problems. Our current RRSIG 
freshness monitoring did not catch this case, because the records we 
monitor still had valid and recent signatures, so we are considering 
what we can do to cover this situation. We have also improved our 
zone-editing pipeline to catch typos or misconfigurations for TTL values.

Next to that, the problem also affected our ability to communicate 
internally, as our internal chat system was unresolvable too. We have 
some means of out-of-band communication, but will review how we can 
improve that.

Additionally, while the status.ripe.net website is hosted on separate 
infrastructure, the fact that it is also in the ripe.net domain meant 
that it was just as unreachable as our other services. We will evaluate 
this approach and see how we can improve on it.

Timeline (times in UTC)

25 October
08:52 a record was added to the ripe.net zone with a TTL of 10 days
08:53 knot incrementally signs ripe.net successfully
09:02 knot fails to sign the ripe.net zone for the first time

1 November

10:45 ripe.net signatures expire and many records go bogus
11:27 DNSSEC validation on internal resolvers was disabled
12:14 changed configuration and manually re-signed zone
12:15 ripe.net zone has new valid signatures
12:38 DNSSEC validation on internal resolvers is re-enabled

2 November

08:39 typo in TTL fixed, bringing it back to 86,400 seconds as intended
08:39 added check in pipeline to detect too large TTL values