[dns-operations] dnsflow again (Re: DNS Traffic Archive Protocol )

Tue Dec 7 19:51:26 UTC 2010

Bedrich Kosata wrote:
> On 12/07/2010 03:29 AM, Robert Edmonds wrote:
> >dnsqr, as constituted in its current version, basically provides a
> >stream of mostly immutable tuples.  (i say mostly because there's no
> >good way to modify an NMSG message from the command line without writing
> >a libnmsg program.)  there are more than just these fields in the real
> >implementation, but for a simple example:
> >
> >     (type, query_ip, response_ip, proto, query_port, response_port, id,
> >         qname, qtype, qclass, rcode, query_packet, response_packet)
> >
> >instead of developing a new message schema for "dnsflow", and a tool to
> >map from the lossless "dnsqr" to the lossy "dnsflow" format, i believe
> >we could simply delete fields from the dnsqr tuple.  e.g., supposing we
> >don't care about the port numbers, the id, or the raw packets, we invoke
> >a not-yet-written tool to delete those fields (this tool could also
> >generate this particular subset of dnsqr from a live network source by
> >simply performing the deletions prior to generating output):
> >
> >     # dnstool -r dnsqr_original.nmsg -w dnsflow.nmsg \
> >         -x id -x query_port -x response_port -x query_packet -x response_packet
> >
> >(this could obviously be simplified to a --iwantthiscombo command line
> >flag but i wanted to show the full generality of the approach.)
> >
> >you would then have a bunch of messages with the smaller tuples:
> >
> >     (type, query_ip, response_ip, proto, qname, qtype, qclass, rcode)
> >
> 
> What about other data that are now apparently either not present or
> present only in the raw packets, such as the time it took the server
> to respond to a query, request of DNSSEC, request of recursion,
> etc.?

we would add new fields to the ISC/dnsqr message schema as necessary.
for instance, we could add an optional "delay" field that is populated
with the difference between the response timestamp and the query
timestamp when the raw query/response packets and their timestamps are
removed from the message.

(in fact the nmsg message module API would make it easy to have a
"delay" field that is either calculated on-demand based on the
query/response packet timestamps or stored directly in a protobuf field
if the packets and their timestamps have been removed from the message.)

similar we could have new optional "qflags" and "rflags" fields that can
be populated with the query flags and the response flags.  and these
don't necessarily have to be copied verbatim from the DNS message
header, they could be synthesized from the message header flags and the
extended EDNS flags, which would take care of recording both the DO and
RD flags.  (similarly those fields can be synthesized on-demand if the
original packets are present in the message.)

> One might also want to store only the answer and authority sections
> of the reply, but not the additional section.
> Could such things be easily included or would a new schema or format
> make more sense?

yes, individual RRsets could be parsed out of the response and
represented by a group of optional fields in the message.

> >first you would sort your collected dnsqr logs, so that all the
> >identical messages are adjacent in the message stream:
> >
> >     # for i in `seq 0 23`; do dnstool --sort -r dnsflow-$i.nmsg \
> >         -w dnsflow-sorted-$i.nmsg&&  rm dnsflow-$i.nmsg; done
> >
> >then you would merge the identical, adjacent messages together:
> >
> >     # dnstool --merge -r dnsflow-sorted-*.nmsg -w dnsflow-merged.nmsg \
> >         &&  rm dnsflow-sorted-*.nmsg
> >
> >now you have a deduplicated set of messages with these fields:
> >
> >     (count, time_first, time_last, type, query_net, response_net, proto,
> >         qname, qtype, qclass, rcode)
> >
> 
> I agree that aggregation is the most powerful compression method in
> this case, but for many applications (such as anomaly detection,
> traffic peak analysis, etc.) it is necessary to have the stream of
> packets sorted by time, not by the content.

but we are not talking about a stream of packets any more, but a stream
of tuples that have been derived from packet contents.  if they are
aggregated together in time (representing multiple queries from a single
IP source with a (time_first, time_last, count) vector) or aggregated
together in space (reducing query_ip to query_net or query_asn) there is
not necessarily a well-defined chronological ordering to the tuples.

anyway, it's not hard to re-sort the message stream on a given field.
or if the application does not benefit from this type of semantic
aggregation simply don't perform it :)  for e.g. traffic graphs you
wouldn't necessarily want to do temporal aggregation.

> Also, unlike for recursive servers, for authoritative servers,
> identical queries from one client get repeated only once per TTL of
> the corresponding RR, so the period for aggregation would have to be
> relatively large (at least when the aggregation is performed on the
> whole tuples as you described),

it is not quite the same thing, but i have done something similar with
the .com ZFA TLD dumps.  basically the NS and A/AAAA RRsets in the .com
zone are repeated once a day and aggregated together.  i have no
problems aggregating entire months of NMSG-formatted sorted .com data
together.  (the only issue is sorting the .com data which in my naïve
implementation loads the data for an entire zone version into memory.)
there aren't any network-level fields (query IPs, query ports, etc.) in
this application but the situation is roughly analogous to a single IP
asking for every record in the zone once a day :)

> thus making the whole process of compression and eventual later
> decompression more demanding.

you mention "decompression" but this is a lossy compression scheme.  how
do you decompress a tuple of elements into the original packet?  are you
instead talking about synthesizing a DNS message that resembles the
original message, e.g. for importing into tools that only process raw
messages?

-- 
Robert Edmonds
edmonds at isc.org