[dns-operations] DNSViz Service Restoration

Jim Popovitch jimpop at domainmail.org
Thu Mar 12 12:48:21 UTC 2020

On Thu, 2020-03-12 at 08:41 -0400, Matthew Pounsett wrote:
> > On Mar 12, 2020, at 07:04, Jim Popovitch via dns-operations <dns-operations at dns-oarc.net> wrote:
> > 
> > 
> > From: Jim Popovitch <jimpop at domainmail.org>
> > Subject: Re: [dns-operations] DNSViz Service Restoration
> > Date: March 12, 2020 at 07:04:23 EDT
> > To: dns-operations at lists.dns-oarc.net
> > 
> > 
> > On March 12, 2020 5:04:23 AM UTC, Casey Deccio <casey at deccio.net> wrote:
> > > Thanks for the perspective.  I believe there is value in being able to answer the question: "what did foo.example.net look like at time X?"
> > 
> > Sounds great.  I think the most important feature of dnsvis was the ability to link to a report to show a recent problem to others.  People haven't had that capability, in over a year, because someone else saw greater value in being able to show very very very old data.
> While the snark may have sounded witty in your head, the decision-
> making was a actually a lot less obvious than that.

I apologize. There was no snark intended.  Dnsviz was a valuable
resource for many years, it suddenly went off line and reappeared with a
significant loss of functionality.  People were promised over and over,
over the course of a year, that the historical aspect and data were
going to re-appear. In the end, the data never appeared, but the part
that most people loved has.  Thank you for finally delivering it.

> Had we known it was going to be a year of hacking at a broken
> database, of course we’d have taken this route in the first place. 
> But, when we first found that some corruption had been introduced it
> wasn’t obvious that would take very long to fix.  At all decision
> points along the way, it appeared as if we were no more than a month
> from having a functioning historical database.
> At the OARC workshop in October, we thought we were hours away from
> announcing that it was back up and running with all of its historical
> data, but the import script running at that time was interrupted by
> the DB running up against its transaction limit, and we had to start a
> vacuum of the db.  That ran for another six weeks before failing on a
> full disk.
> About six months in we started to consider the possibility of
> resetting the database and merging old data later, but that’s a much
> more complicated procedure as it involves both restructuring the
> corruption that broken the import in the first place AND massaging
> that data on import to avoid collisions with newly created rows that
> have unique constraints on them, all on top of the increased time it
> would take to do such an import while the service is active.  There’s
> also the risk that certain tests could never be imported as-is because
> of the potential of a new test’s reference name (the unique 6
> characters in a specific test’s URL) colliding with an old test’s
> name, causing any stored URLs out there to show the wrong test data.
> And Casey isn’t the only one who looks at—or links to—old tests; there
> are web sites out there with links to old tests used as a historical
> record or as case studies of the ways DNS can be broken, so it still
> seems useful to get those tests back online somehow.

Thank you for those details, they make for an interesting postmortem. 

-Jim P.

More information about the dns-operations mailing list