[dns-operations] Question for fellow users of RIPE ATLAS (broken/saturated probes or what?)

Tue Dec 11 14:33:04 UTC 2018

Hi,

Since this is also important here for our Ops team, I decided to drill
down a bit more.

*TL;DR summary*: I could not find the same issues in both on b-root data
or .nl data; therefore maybe something transient?  @Jake: Could you run
this measurement again?

best,

/giovane

Analysis
============================

@Jake: is this the measurement you used for these graphs [0]?

I've parsed this results and added probe version and other info,
including catchment (anycast site). See 17947276.csv.gz.

Anycast sites mapping hypothesis:
  * Diff probes could go to different countries (bgp mappings are
sometimes a bit chaotic, see [1]). For example, probes in the NL could
wind up being answered by a site in Japan. However, this is not the
case: all probes from [0] are answered by London sites.

The only correlation we can clearly see is this:
  * all system-version 1 probes in your measurement are slow
  * most of version 2 (v2) are slow; some are fast
  * v3 and v4 are really fast

OK, so the question is: is this also observed in other datasets?

I did two things to verify this.

Step 1: Analyze chaos B-root queries (built-in atlas measurement)
=================================================================

Atlas has built-in measurements, which means that all probes are used
and they are continuously monitored.

I downloaded 1 hour of measurements from Atlas for b-root using [2],
from today (GMT 00:00 to 01:00). This measurement, like [0], is
hostanme.bind DNS chaos queries.

Then, I parsed the results and procedure with some filtering:

 - Consider only measurements going to one anycast site (LAX) -- so all
probes would hit the same site (the other b-root site is MIAMI)
 - Consider only probes within one country. I choose Germany because it
has many probes (good for coverage), and it's far enough from LAX so
most of the latency from anywhere in Germany to LAX should be, in
theory, similar.

Ultimately, we ended up with:
 -  82 probes version 1 (extract from tag 'system-v1') from [3]
 - 231 probes version 2 (tag 'system-v2' from [3] )
 - 987 probes version 3 (tag 'system-v3' from [3] )
 -  24 proves version 4 (tag 'system-v4' from [3])

Ok, so now that we have only Germany-based probes reaching only one site
of b-root (LAX), See 10310-onlyDE.csv.gz

By analyzing the RTT distribution for these probes and versions, we
found that there is as median RTT :

   * V1:  167.9 ms
   * V2:  167.0 ms
   * V3:  164.4 ms
   * V4:  168.7 ms

So the hypothesis of all V1 and V2 probes being way 'slower' does not
hold for b-root today's measurement.

So what  now?
=================

Your .ca measurements show an issue with probe version, b-root one does
not, at least for Germany.

There are other hypothesis:
 - built-in measurements have higher priority, so b-root would not be
affected

To rule  this out, let's analyze another measurement, on .nl

Step 2: analyzing .nl data
==========================

We use measurement [4] in this analysis, executed on  2018-12-07, with
~10k probes. Let's again filter German probes using the same procedure
used in step 1. But now we filter only probes staying in Berlin --
hostname.bind tld-nl-fra1 or tld-nl-fra2).

And it turns out that out of my 10k asked probes, NONE had version 1...
Actually, the smallest probeID value I got was 2000. In the b-root
dataset, I found that V1 probes have probeID<1000.

Which is good news, since Atlas seems to automatically remove those
out...But what about the other firmware versions?

We ended up with the following distribution of unique Germany-based
atals probes reaching frankfurt:

   * 0 V1 probes
   * 155 V2 probes
   * 728 V3 probes
   * 18 V4 probes

See 17927373-OnlyDE.csv.gz

What about their median RTT?

   * V2:  17.6 ms
   * V3:  14.2 ms
   * V4:  15.7 ms

As for b.root, the measurements for .nl also do not have a significant
latency diff according to probe version.

So the hypothesis of V1/V2 being far slower that V3/V4 again does not
hold for.nl.

So, IMHO, the different on the median is negligible, depending on your
interest. If you're measuring clock skews, hell that'd matter.. but
you're engineering anycast , that's OK.

So how to move forward?
======================

@ Jake: could you repeat your measurement to see if your results where
due to a transient issue -- like atlas being overloaded around that
time. And do not run traceroute on the same probes as you measure DNS,
just in case.

The authors from [5] have analyzed probe versions and performance, and
their results confirm what I found, not what Jaked report.

Any other thoughts?

/giovane

[0]  https://atlas.ripe.net/measurements/17947276
[1] http://www.isi.edu/%7ejohnh/PAPERS/Schmidt17a.pdf
[2]
https://atlas.ripe.net/api/v2/measurements/10310/results/?start=1544486400&stop=1544490000&format=json
[3]
https://ftp.ripe.net/ripe/atlas/probes/archive/2018/12/20181210.json.bz2
[4] https://atlas.ripe.net/measurements/17927373/
[5]
http://www.sigcomm.org/sites/default/files/ccr/papers/2015/July/0000000-0000005.pdf

On 12/11/18 10:13 AM, Giovane Moura wrote:
> Hi Jake,
> 
> Thanks for pointing this. We also monitor our authoritative name
> servers but I was not aware of this issue. This is particularly
> important when deciding where to deploy new anycast sites.
> 
> Say a country X observes 200ms latency to one of your NSes. You'd
> need to break it down per probe version,as you did, and consider
> version>6000 to be sure that this is not an artifact.
> 
> Monitoring systems that rely of the variation of RTT should be OK,
> at least in the aggregate.
> 
> thanks,
> 
> /giovane
> 
> 
> On 12/7/18 10:13 PM, Jake Zack wrote:
>> Hey all,
>> 
>> 
>> 
>> I often do DNS tests via RIPE ATLAS to confirm that changes and/or
>> new additions to our anycast network haven’t created any collateral
>> damage to our clouds and others.
>> 
>> 
>> 
>> I’ve noticed that the first several thousand probes (probe ID’s <
>> ~6000) consistently return inaccurate/terrible/useless results to
>> DNS tests.
>> 
>> 
>> 
>> Has anyone else noticed this?  Has anyone else reported it to RIPE 
>> ATLAS?  Any theories?
>> 
>> 
>> 
>> I’ll first use DNS queries coming out of Belgium as an example…
>> 
>> 
>> 
>> My average response time from Belgium to ANY.CA-SERVERS.CA is
>> 25.103 ms when you exclude probe ID’s < ~6000. With those probes,
>> it’s 83.944 ms – Greater than a 300% difference?!
>> 
>> 
>> 
>> The worst part is that the ‘traceroute’ functionality never seems
>> to show this latency…so I feel powerless in fixing this brokenness
>> from my end.
>> 
>> 
>> 
>> And it’s not just traffic from Belgium, to be clear…
>> 
>> 
>> 
>> My average response time from Ireland to ANY.CA-SERVERS.CA is
>> 21.296 ms when you exclude probe ID’s < ~6000. With those probes,
>> it’s 33.507 ms.
>> 
>> My average response time from Netherlands to ANY.CA-SERVERS.CA is
>> 22.187 ms when you exclude probe ID’s < ~6000. With those probes,
>> it’s 77.280ms.
>> 
>> My average response time from Poland to ANY.CA-SERVERS.CA is 43.910
>> ms when you exclude probe ID’s < ~6000. With those probes, it’s
>> 89.235 ms.
>> 
>> 
>> 
>> And it’s not just CIRA…
>> 
>> 
>> 
>> The average response time from Belgium to SNS-PB.ISC.ORG is 22.883
>> ms when you exclude probe ID’s < ~6000.  With those probes, it’s
>> 33.169 ms.
>> 
>> The average response time from Belgium to F.ROOT-SERVERS.NET is
>> 15.094 ms when you exclude probe ID’s < ~6000. With those probes,
>> it’s 23.076 ms.
>> 
>> The average response time from Belgium to AC1.NSTLD.COM is 36.333
>> ms when you exclude probe ID’s < ~6000.  With those probes, it’s
>> 44.900 ms.
>> 
>> The average response time from Belgium to A0.INFO.AFILIAS-NST.INFO
>> is 146.38 ms when you exclude probe ID’s < ~6000.  With those
>> probes, it’s 155.70 ms.
>> 
>> The average response time from Belgium to X.NS.DNS.BE is 67.214 ms
>> when you exclude probe ID’s < ~6000.  With those probes, it’s 
>> 83.836 ms.
>> 
>> 
>> 
>> I’m attaching some photos to visually show just how ineffective
>> these probes are at measuring anything related to DNS…
>> 
>> 
>> 
>> Confirmations on others seeing this, or reporting this, or moving
>> away from RIPE ATLAS for these measurements because of this?
>> Recommendations other than ThousandEyes (I’m not interested in
>> paying $100K/year for what costs $500/year in VM’s and perl
>> scripts).
>> 
>> 
>> 
>> Ideas on how to get this rectified so that these tests can be
>> useful again?
>> 
>> 
>> 
>> -Jacob Zack
>> 
>> DNS Architect – CIRA (.CA TLD)
>> 
>> 
>> _______________________________________________ dns-operations
>> mailing list dns-operations at lists.dns-oarc.net 
>> https://lists.dns-oarc.net/mailman/listinfo/dns-operations 
>> dns-operations mailing list 
>> https://lists.dns-oarc.net/mailman/listinfo/dns-operations
>> 
> 
> _______________________________________________ dns-operations
> mailing list dns-operations at lists.dns-oarc.net 
> https://lists.dns-oarc.net/mailman/listinfo/dns-operations 
> dns-operations mailing list 
> https://lists.dns-oarc.net/mailman/listinfo/dns-operations
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 17947276.csv.gz
Type: application/gzip
Size: 8146 bytes
Desc: 17947276.csv.gz
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20181211/2b190c3f/attachment.gz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 10310-OnlyDE.csv.gz
Type: application/gzip
Size: 359079 bytes
Desc: 10310-OnlyDE.csv.gz
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20181211/2b190c3f/attachment-0001.gz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 17927373-OnlyDE.csv.gz
Type: application/gzip
Size: 96779 bytes
Desc: 17927373-OnlyDE.csv.gz
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20181211/2b190c3f/attachment-0002.gz>