[dns-operations] dnsflow again (Re: DNS Traffic Archive Protocol)

Thu Dec 9 01:16:13 UTC 2010

in pursuit of the virtues of openness and transparency, robert and i have
decided to continue this discussion in public even though we are co-workers
and most of our discussions are simply 1x1.  i hope that we are entertaining.

> Date: Mon, 6 Dec 2010 21:29:40 -0500
> From: Robert Edmonds <edmonds at isc.org>
> 
> > now, that's "at its simplest" and i think it's easy to argue that it's so
> > simple as to be useless.  without compound buckets you don't know what
> > you'd need to know.  so we might like to see these additional dnsflows:
> > 
> > kind		thing				count
> > -----------------------------------------------------
> > qtuple-by-cli	isc.org/in/a-204.152.187.6	1
> > [...]
> 
> you're starting to lose me with your "kind" / "thing" examples but i
> think i get the gist of what you're saying.

i think it's possible you are not getting what i'm saying.  let's find out.

> ...
> 
> instead of developing a new message schema for "dnsflow", and a tool to
> map from the lossless "dnsqr" to the lossy "dnsflow" format, i believe
> we could simply delete fields from the dnsqr tuple.  e.g., supposing we
> don't care about the port numbers, the id, or the raw packets, we invoke
> a not-yet-written tool to delete those fields (this tool could also
> generate this particular subset of dnsqr from a live network source by
> simply performing the deletions prior to generating output):
> ...
> you would then have a bunch of messages with the smaller tuples:
> 
>     (type, query_ip, response_ip, proto, qname, qtype, qclass, rcode)

i don't want deletion or reduction, i'm looking for metrics made up of
key:value pairs where the "key" is a compound like "how many times did host
X perform action Y" and "value" is a counter.  this isn't just lossy, it's
transformative.  you could see a million dnsqr's per second on input and
generate a hundred dnsflow's per second on output, depending on how self-
similar the inputs were.  for that matter you could see a thousand dnsqr's
per second on input and get tens of thousands of dnsflow's per second on
output if the self-similarity of the input was low enough.

> in addition to simple deletions, we could also do reductions.  e.g., one
> could select a prefix length and reduce the query_ip and response_ip
> fields to new "query_net" and "response_net" fields that are filled with
> the network prefixes covering the original IPs.  the *_ip fields would
> then be deleted.
> ...
> (for extra credit, import a BGP table dump and dynamically aggregate on
> longest covering prefix, or reduce to *_asn fields instead of *_net
> fields.)

i love this idea as applied to my idea.  flow buckets whose key was a
network rather than a host would clearly be very worthwhile.  having them
be mapped by BGP or alternatively done by stripping are both useful.

> now you have a stream of reduced tuples:
> 
>     (type, query_net, response_net, proto, qname, qtype, qclass, rcode)
> 
> then suppose that you wanted to aggregate these reduced-tuple messages
> -- messages that are identical except for timestamp are collapsed into a
> single message with additional (time_first, time_last, count) fields.
> (which can themselves be collapsed together, summing the counts and
> setting the timestamp pair to the earliest/latest values.)
> ...
> now you have a deduplicated set of messages with these fields:
> 
>     (count, time_first, time_last, type, query_net, response_net, proto,
>         qname, qtype, qclass, rcode)

i can see a use for this, as a reduction, but it's not what i want for flow
metrics.  perhaps both objectives should be pursued.  the above reduction
would lose too much information (looking it as a possible middle end between
dnsqr and dnsflow) since it would fold traffic bursts together.  burstiness
is an important aspect of traffic shape/character.

> ...
> 
> i think a hypothetical tool like the one i've described would help solve
> CZ.NIC's problem upthread (except for the need for extreme bit-packing
> efficiency, but since NMSG has transparent zlib support and benefits
> from protobuf's varint encoding i tend not to worry about encoding
> efficiency too much) as well as a more general class of problems.

according to my read of cz.nic's original problem statement, they need both
an expanded dnsqr schema capable of representing the reductions you're
describing, and a new dnsflow schema and supporting tools.  (i know that
because nmsg uses google protocol buffers we could add nearly anything we
wanted into any schema, but dnsflow would share almost no fields with dnsqr.)

my examples were clearly too shoddy to show what i was talking about.  i'll
think about this overnight and produce something more meatful tomorrowish.

> (there is also the problem that dnsqr supports TCP packets but does not
> yet have TCP stream reassembly support, but that's just a Small Matter
> of Programming.)

indeed.  we've got ip reassembly, now we need the tcp reassembly and also
the tcp stream parsing (since a tcp stream can have multiple dns messages).

higher level view:

a TLD or root or significant-SLD server operator should be running nmsgtool
on their BPF or should run an instrumented dns server.  dnsqr's ability to
represent cache creation/eviction events does not enter into the authority
side, so unless there's a problem reassembling TCP via BPF, it doesn't
matter how an authority servers produces a dnsqr stream.  the stream should
be split, and fed into the dnsflow filter as well as fed (raw) into the
dnsqr reduction filter.  the raw dnsqr feed would be stored in a circular
buffer (hourly disk files purged after 48 hours or whatever).  the reduced
dnsqr streams and the dnsflow streams should be forwarded to a NOC server
and aggregated with the streams from the other servers in the cluster, and
then stored and analyzed.