[Collisions] source prefix analysis of DITL traffic

Fri Aug 23 20:09:45 UTC 2013

On Aug 23, 2013, at 2:32 PM, Jim Reid <jim at rfc1035.com> wrote:

> On 23 Aug 2013, at 17:21, Gavin Brown <gavin.brown at centralnic.com> wrote:
> 
>> One metric which would be useful is also which networks queries come
>> from. If .TLD gets a bunch of junk queries, but all these queries come
>> from a single misbehaving resolver or network, that is a much easier
>> problem to fix than if they come from a large number of networks.
> 
> Indeed it is Gavin. However I doubt you'll find much sign of a single misbehaving resolver or network in the DITL datasets.
> 
> To the best of my recollection, almost none of the new gTLD traffic I looked at fitted that pattern. Maybe some of the more obscure gTLDs that ICANN has decided are "low risk" has traffic from just a few prefixes. Our report only listed the prefix counts for the the top 100 new gTLDs but I did gather that data for all of them. Let me see if I can find that info. From memory, there was ~1GB of prefix data from each DITL run: ie for each gTLD identify which prefixes generated the traffic and how often each of those prefixes appeared.
> 
> BTW we sampled just a few of those prefix counts and found that they generally followed a power-law distribution. [There wasn't time to do them all.]

I just wanted to mention that Google (a DNS-OARC member) has a service called BigQuery. 

"Querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google's infrastructure. Simply move your data into BigQuery and let us handle the hard work. You can control access to both the project and your data based on your business needs, such as giving others the ability to view or query your data."

This allows you to do simple, fast queries across multi-TB data stores.

For example, put: 
SELECT COUNT (contributor_ip), contributor_ip FROM [publicdata:samples.wikipedia] GROUP EACH BY contributor_ip ORDER BY F0_ DESC LIMIT 1000
into https://bigquery.cloud.google.com/

I realize that we cannot put the current DITL data into this (because of the data sharing agreements), but if this sounds like something that would be very useful I can look into getting some quota donated and working with folk to get data contributed. 

W

> So if for the gTLDs that you're interested in, there just might be a low number of prefixes that account for the traffic. If these were discounted somehow (handwave!), perhaps that could be enough to shuffle a gTLD into ICANN's low risk category.
> 
> Bear in mind too that some of the source addresses in the DITL pcaps have been obfuscated. So prepare for disappointment just in case the traffic for your gTLD(s) turns out to mainly come from a /24 of RFC1918 space or a /32 of locally scoped IPv6.
> 
> I am also not sure what OARC's policy is about disclosing the actual prefix data once you've done that analysis Gavin. This does appear to be a grey area. While it may be OK to share that info amongst OARC members, it may well not be OK to distribute the prefix data further. Even if this is OK, some RSOs could be uncomfortable or unhappy about that. Keith?
> _______________________________________________
> Collisions mailing list
> Collisions at lists.dns-oarc.net
> https://lists.dns-oarc.net/mailman/listinfo/collisions
> 

--
"When it comes to glittering objects, wizards have all the taste and self-control of a deranged magpie."
-- Terry Pratchett