[dns-operations] Report on recent signature expiry in IP6.ARPA
dave.knight at icann.org
Mon May 23 14:41:27 UTC 2011
IP6.ARPA is the domain used for the so-called "Reverse DNS" in IPv6, providing a namespace for mapping IPv6 addresses to names.
A failure in the signing infrastructure for IP6.ARPA resulted in a stale zone being published for a prolonged period. Deficiencies in the monitoring infrastructure prevented this situation from being noticed in a timely manner. Once signatures in the stale zone expired, DNSSEC validation of answers from the IP6.ARPA domain was not possible. No other zones published by ICANN were affected.
ICANN DNS Operations staff were made aware of the situation by e-mail sent from various people in the DNS technical community. Once the fault had been diagnosed and fixed, publication of IP6.ARPA proceeded normally and the ability to validate answers from the IP6.ARPA domain was restored.
The window in which validation failures would have occurred began at 2011-05-15 08:47 UTC and ended at 2011-05-16 01:20 UTC.
ICANN is sharing this information with the DNS technical community in order to promote awareness of operational aspects of DNSSEC. This is an interim report intended to provide timely information about this incident, and some details contained within it may change in the future as analysis and related development of ICANN's production signing and monitoring platforms continues.
ICANN's signing infrastructure for IP6.ARPA is based on a distributed set of signers running OpenDNSSEC version 1.0. At any time only one host is designated as the active signer. OpenDNSSEC state is replicated between machines in order to facilitate manual fail-over. Private key material is stored only on HSMs.
OpenDNSSEC stores state relating to the ongoing process of signing zones. At some point between 2011-05-08 03:26 UTC and 2011-05-08 06:25 UTC the active signer's retained state for the IP6.ARPA zone appears to have become corrupted. Due to the corrupted state, successive signer runs did not produce a signed zone, and hence no updated signatures were published following that time. The root cause of the corruption has not yet been precisely identified.
The stored state from the IP6.ARPA zone was archived and removed, and the signer process was restarted. A validatable IP6.ARPA zone was published to IP6.ARPA nameservers within a few seconds.
ICANN performs a large array of functional tests against many nameservers and zones periodically and generates regular e-mail containing a summary of any defects found. This report includes the results of various tests, including DNS query failures over various transports and also signatures which appear not to be refreshed. The results are combined in a single report. This report has proven effective in the past in providing timely notification and escalation of production problems.
At the time of this incident members of the L-Root Prague cluster were unreachable, due to an earlier maintenance which had been extended following a router failure. Since all other L-Root nodes were performing normally during this period, and since the L-Root constellation as a whole has substantial spare capacity, this was not a service-affecting problem for L-Root.
However, the regular periodic reports contained the results of many tests which were failing due to the prolonged maintenance in Prague. Operations staff became conditioned during this period to the report's contents relating to L-Root in Prague, which had a known cause and which was not service-impacting. When signature expiration warnings started to appear in the reports they were consequently overlooked.
1. ICANN will modify the existing notification process for its monitoring system to ensure that warnings of approaching signature expiration are communicated separately from (and in addition to) the existing reports containing results of the other various tests. Notification will be designed to be noisy and difficult to overlook, and distribution of the notifications will include management and staff from other departments.
2. ICANN's operational procedures relating to planned maintenance will be modified to ensure that monitoring is disabled for services known to be down for non-service-impacting reasons. This will reduce the amount of noise and better constrain the contents of regular reports to issues which require action.
3. ICANN has an existing project underway to perform captive regression testing of a more recent version of OpenDNSSEC for use in its DNSSEC signing infrastructure, and to upgrade the production platforms accordingly. This work continues.
Timeline (all dates and times UTC)
2011-05-08 03:26 last successful signing of the IP6.ARPA zone
2011-05-08 06:25 first failed signing of the IP6.ARPA zone
2011-05-15 08:47 signatures over the DNSKEY RRSet in the IP6.ARPA zone expire
2011-05-15 19:39 first report of validation failures in the IP6.ARPA domain
2011-05-16 01:20 correctly-signed zone distributed, validation now possible
More information about the dns-operations