[dns-operations] Exactly simultaneous PowerDNS Recursor Crashes in a number of places

Thu Jan 28 20:58:55 UTC 2010

People,

On the 28th of January 03:40:44 CET (28th of January 02:40:44 UTC, 27th of
January 21:40:44 EST, plus or minus 2 seconds), two very large PowerDNS
Recursor deployments, at completely unrelated companies on two different
continents, saw simultaneous crashes of a large part of their PowerDNS
Recursor Server Farm.

A second wave happened around 14:45 CET.

Since then, all has been quiet. The nearly simultaneous nature of these
crashes, the cause of which we've been unable to pin down, is intriguing. 
Both deployments are now tcpdumping packets, and are running an instrumented
version of the PowerDNS Recursor.

For one of the deployments, the PowerDNS servers are bound to regions, so a
single bad customer could not have caused all these crashes.

So either there was a well coordinated transmission of "nastygrams" from
dozens of clients (or stubs) at the timestamps mentioned above, or a very
busy authoritative server emitted something that the PowerDNS recursor
chocked on.

Of course, the blame for these crashes falls squarely onto the shoulders of
PowerDNS. 

However, I'm curious if other people saw odd things around the two
timestamps mentioned above.

The line to look for in the PowerDNS log is: 
> Jan 28 03:40:45 ns9 pdns_recursor[6244]: STL Exception: St9bad_alloc

Other resolvers may have logged other messages.

I hope to hear from you - not just to debug PowerDNS of course, but also to
rule out any interaction with the DURZ deployment (although I consider this
interaction to be INCREDIBLY unlikely).

	Bert