[dns-operations] Question for fellow users of RIPE ATLAS (broken/saturated probes or what?)

Wed Dec 12 10:38:45 UTC 2018

I noticed that Gio analyses measurements where probes send multiple queries over time, whereas Jake carries out “one-off” measurements.

I carried out a one-off measurement to ns1.sidn.nl, which is unicast, from 500 probes in the Netherlands and I could reproduce Jake's observations. [1]

Too rule out some resolving issues, I repeated the same measurement, querying the IP address of ns1.sidn.nl directly.
But the results are the same [2].

So our first hypothesis was that only the first query takes a long time.
So we looked at the response time of the first queries of Gio's measurements to ns3.dns.nl, but even at the first query, old probes do not show higher response times. [3]

So my new theory is  that the issue lays in the scheduler of the one-off queries.
According to a blog post by RIPE [6], and mentioned in [5], one off measurements are started by the “eooqd" deamon whereas they use “eperd" to run continuous measurements.

So maybe older probes running in some issue with the eooqd deamon.

That means that the problems described in [4] and [5] are still valid, but at least the huge penalty seen by Jake is not a big problem for continuous measurements.

[1] https://atlas.ripe.net/measurements/18086197/#!probes
[2] https://atlas.ripe.net/measurements/18086767/#!probes
[3] https://atlas.ripe.net/measurements/17927373/#!probes
[4] https://dl.acm.org/citation.cfm?doid=2805789.2805796
[5] https://clarinet.u-strasbg.fr/~pelsser/publications/Holterbach-ripe-atlas-sharing-imc2015.pdf
[6] https://labs.ripe.net/Members/philip_homburg/ripe-atlas-measurements-source-code

Moritz

> On 11 Dec 2018, at 15:33, Giovane Moura <giovane.moura at sidn.nl> wrote:
> 
> Hi,
> 
> Since this is also important here for our Ops team, I decided to drill
> down a bit more.
> 
> *TL;DR summary*: I could not find the same issues in both on b-root data
> or .nl data; therefore maybe something transient?  @Jake: Could you run
> this measurement again?
> 
> best,
> 
> /giovane
> 
> Analysis
> ============================
> 
> @Jake: is this the measurement you used for these graphs [0]?
> 
> I've parsed this results and added probe version and other info,
> including catchment (anycast site). See 17947276.csv.gz.
> 
> Anycast sites mapping hypothesis:
>  * Diff probes could go to different countries (bgp mappings are
> sometimes a bit chaotic, see [1]). For example, probes in the NL could
> wind up being answered by a site in Japan. However, this is not the
> case: all probes from [0] are answered by London sites.
> 
> The only correlation we can clearly see is this:
>  * all system-version 1 probes in your measurement are slow
>  * most of version 2 (v2) are slow; some are fast
>  * v3 and v4 are really fast
> 
> OK, so the question is: is this also observed in other datasets?
> 
> I did two things to verify this.
> 
> Step 1: Analyze chaos B-root queries (built-in atlas measurement)
> =================================================================
> 
> Atlas has built-in measurements, which means that all probes are used
> and they are continuously monitored.
> 
> I downloaded 1 hour of measurements from Atlas for b-root using [2],
> from today (GMT 00:00 to 01:00). This measurement, like [0], is
> hostanme.bind DNS chaos queries.
> 
> Then, I parsed the results and procedure with some filtering:
> 
> - Consider only measurements going to one anycast site (LAX) -- so all
> probes would hit the same site (the other b-root site is MIAMI)
> - Consider only probes within one country. I choose Germany because it
> has many probes (good for coverage), and it's far enough from LAX so
> most of the latency from anywhere in Germany to LAX should be, in
> theory, similar.
> 
> Ultimately, we ended up with:
> -  82 probes version 1 (extract from tag 'system-v1') from [3]
> - 231 probes version 2 (tag 'system-v2' from [3] )
> - 987 probes version 3 (tag 'system-v3' from [3] )
> -  24 proves version 4 (tag 'system-v4' from [3])
> 
> Ok, so now that we have only Germany-based probes reaching only one site
> of b-root (LAX), See 10310-onlyDE.csv.gz
> 
> By analyzing the RTT distribution for these probes and versions, we
> found that there is as median RTT :
> 
>   * V1:  167.9 ms
>   * V2:  167.0 ms
>   * V3:  164.4 ms
>   * V4:  168.7 ms
> 
> So the hypothesis of all V1 and V2 probes being way 'slower' does not
> hold for b-root today's measurement.
> 
> So what  now?
> =================
> 
> Your .ca measurements show an issue with probe version, b-root one does
> not, at least for Germany.
> 
> There are other hypothesis:
> - built-in measurements have higher priority, so b-root would not be
> affected
> 
> 
> To rule  this out, let's analyze another measurement, on .nl
> 
> Step 2: analyzing .nl data
> ==========================
> 
> We use measurement [4] in this analysis, executed on  2018-12-07, with
> ~10k probes. Let's again filter German probes using the same procedure
> used in step 1. But now we filter only probes staying in Berlin --
> hostname.bind tld-nl-fra1 or tld-nl-fra2).
> 
> And it turns out that out of my 10k asked probes, NONE had version 1...
> Actually, the smallest probeID value I got was 2000. In the b-root
> dataset, I found that V1 probes have probeID<1000.
> 
> Which is good news, since Atlas seems to automatically remove those
> out...But what about the other firmware versions?
> 
> We ended up with the following distribution of unique Germany-based
> atals probes reaching frankfurt:
> 
>   * 0 V1 probes
>   * 155 V2 probes
>   * 728 V3 probes
>   * 18 V4 probes
> 
> See 17927373-OnlyDE.csv.gz
> 
> What about their median RTT?
> 
>   * V2:  17.6 ms
>   * V3:  14.2 ms
>   * V4:  15.7 ms
> 
> 
> As for b.root, the measurements for .nl also do not have a significant
> latency diff according to probe version.
> 
> 
> 
> So the hypothesis of V1/V2 being far slower that V3/V4 again does not
> hold for.nl.
> 
> So, IMHO, the different on the median is negligible, depending on your
> interest. If you're measuring clock skews, hell that'd matter.. but
> you're engineering anycast , that's OK.
> 
> So how to move forward?
> ======================
> 
> @ Jake: could you repeat your measurement to see if your results where
> due to a transient issue -- like atlas being overloaded around that
> time. And do not run traceroute on the same probes as you measure DNS,
> just in case.
> 
> The authors from [5] have analyzed probe versions and performance, and
> their results confirm what I found, not what Jaked report.
> 
> Any other thoughts?
> 
> /giovane
> 
> 
> [0]  https://atlas.ripe.net/measurements/17947276
> [1] http://www.isi.edu/%7ejohnh/PAPERS/Schmidt17a.pdf
> [2]
> https://atlas.ripe.net/api/v2/measurements/10310/results/?start=1544486400&stop=1544490000&format=json
> [3]
> https://ftp.ripe.net/ripe/atlas/probes/archive/2018/12/20181210.json.bz2
> [4] https://atlas.ripe.net/measurements/17927373/
> [5]
> http://www.sigcomm.org/sites/default/files/ccr/papers/2015/July/0000000-0000005.pdf
> 
> On 12/11/18 10:13 AM, Giovane Moura wrote:
>> Hi Jake,
>> 
>> Thanks for pointing this. We also monitor our authoritative name
>> servers but I was not aware of this issue. This is particularly
>> important when deciding where to deploy new anycast sites.
>> 
>> Say a country X observes 200ms latency to one of your NSes. You'd
>> need to break it down per probe version,as you did, and consider
>> version>6000 to be sure that this is not an artifact.
>> 
>> Monitoring systems that rely of the variation of RTT should be OK,
>> at least in the aggregate.
>> 
>> thanks,
>> 
>> /giovane
>> 
>> 
>> On 12/7/18 10:13 PM, Jake Zack wrote:
>>> Hey all,
>>> 
>>> 
>>> 
>>> I often do DNS tests via RIPE ATLAS to confirm that changes and/or
>>> new additions to our anycast network haven’t created any collateral
>>> damage to our clouds and others.
>>> 
>>> 
>>> 
>>> I’ve noticed that the first several thousand probes (probe ID’s <
>>> ~6000) consistently return inaccurate/terrible/useless results to
>>> DNS tests.
>>> 
>>> 
>>> 
>>> Has anyone else noticed this?  Has anyone else reported it to RIPE
>>> ATLAS?  Any theories?
>>> 
>>> 
>>> 
>>> I’ll first use DNS queries coming out of Belgium as an example…
>>> 
>>> 
>>> 
>>> My average response time from Belgium to ANY.CA-SERVERS.CA is
>>> 25.103 ms when you exclude probe ID’s < ~6000. With those probes,
>>> it’s 83.944 ms – Greater than a 300% difference?!
>>> 
>>> 
>>> 
>>> The worst part is that the ‘traceroute’ functionality never seems
>>> to show this latency…so I feel powerless in fixing this brokenness
>>> from my end.
>>> 
>>> 
>>> 
>>> And it’s not just traffic from Belgium, to be clear…
>>> 
>>> 
>>> 
>>> My average response time from Ireland to ANY.CA-SERVERS.CA is
>>> 21.296 ms when you exclude probe ID’s < ~6000. With those probes,
>>> it’s 33.507 ms.
>>> 
>>> My average response time from Netherlands to ANY.CA-SERVERS.CA is
>>> 22.187 ms when you exclude probe ID’s < ~6000. With those probes,
>>> it’s 77.280ms.
>>> 
>>> My average response time from Poland to ANY.CA-SERVERS.CA is 43.910
>>> ms when you exclude probe ID’s < ~6000. With those probes, it’s
>>> 89.235 ms.
>>> 
>>> 
>>> 
>>> And it’s not just CIRA…
>>> 
>>> 
>>> 
>>> The average response time from Belgium to SNS-PB.ISC.ORG is 22.883
>>> ms when you exclude probe ID’s < ~6000.  With those probes, it’s
>>> 33.169 ms.
>>> 
>>> The average response time from Belgium to F.ROOT-SERVERS.NET is
>>> 15.094 ms when you exclude probe ID’s < ~6000. With those probes,
>>> it’s 23.076 ms.
>>> 
>>> The average response time from Belgium to AC1.NSTLD.COM is 36.333
>>> ms when you exclude probe ID’s < ~6000.  With those probes, it’s
>>> 44.900 ms.
>>> 
>>> The average response time from Belgium to A0.INFO.AFILIAS-NST.INFO
>>> is 146.38 ms when you exclude probe ID’s < ~6000.  With those
>>> probes, it’s 155.70 ms.
>>> 
>>> The average response time from Belgium to X.NS.DNS.BE is 67.214 ms
>>> when you exclude probe ID’s < ~6000.  With those probes, it’s
>>> 83.836 ms.
>>> 
>>> 
>>> 
>>> I’m attaching some photos to visually show just how ineffective
>>> these probes are at measuring anything related to DNS…
>>> 
>>> 
>>> 
>>> Confirmations on others seeing this, or reporting this, or moving
>>> away from RIPE ATLAS for these measurements because of this?
>>> Recommendations other than ThousandEyes (I’m not interested in
>>> paying $100K/year for what costs $500/year in VM’s and perl
>>> scripts).
>>> 
>>> 
>>> 
>>> Ideas on how to get this rectified so that these tests can be
>>> useful again?
>>> 
>>> 
>>> 
>>> -Jacob Zack
>>> 
>>> DNS Architect – CIRA (.CA TLD)
>>> 
>>> 
>>> _______________________________________________ dns-operations
>>> mailing list dns-operations at lists.dns-oarc.net
>>> https://lists.dns-oarc.net/mailman/listinfo/dns-operations
>>> dns-operations mailing list
>>> https://lists.dns-oarc.net/mailman/listinfo/dns-operations
>>> 
>> 
>> _______________________________________________ dns-operations
>> mailing list dns-operations at lists.dns-oarc.net
>> https://lists.dns-oarc.net/mailman/listinfo/dns-operations
>> dns-operations mailing list
>> https://lists.dns-oarc.net/mailman/listinfo/dns-operations
>> 
> <17947276.csv.gz><10310-OnlyDE.csv.gz><17927373-OnlyDE.csv.gz>_______________________________________________
> dns-operations mailing list
> dns-operations at lists.dns-oarc.net
> https://lists.dns-oarc.net/mailman/listinfo/dns-operations
> dns-operations mailing list
> https://lists.dns-oarc.net/mailman/listinfo/dns-operations

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: Message signed with OpenPGP
URL: <https://lists.dns-oarc.net/pipermail/dns-operations/attachments/20181212/093de1a8/attachment.sig>