Some notes on perceived DNS resolver performance

The perceived performance of a DNS resolver, defined as the average time taken to deliver an answer, can be modeled as follows.

We define \( t_{auth} \) as the average response time of a remote authoritative server, and \( t_{cached} \) as how fast our resolver implementation responds in the cached case.

If we set define \( P_{cached} \) as the chance of a cache hit, the average response time \( t_{resolve} \) becomes: $$ t_{resolve} = P_{cached} \cdot t_{cached} + (1-P_{cached}) \cdot t_{auth} $$ This in itself is not yet overly insightful, but this becomes more so if we model \( P_{cached} \). If we assume that an entry stays in the cache until the TTL has expired, and we take into account the average query rate, we can guess the amount of cache hits we will get after the initial miss. This assumes that the query rate is high enough that we will get a significant number of queries before the TTL expires.

A special case are domains for which the reciprocal query rate is longer than the TTL, but such cases are either rare (because of a low query rate), or pathological (because of an unwisely low choice of TTL). In this case, \(P_{cached} = 0\), leading to \(t_{resolve} = t_{auth}\).

For the normal case, if the Query rate is \( R \) and the average TTL is \( T \), it follows that per time period \( T \): $$ \begin {aligned} N_{hits} & = R \cdot T - 1 \\ N_{misses} & = 1 \end {aligned} $$ This leads to: $$ \begin{aligned} P_{cached} & = \frac{N_{hits}}{N_{hits} + N_{misses}} = \frac{N_{hits}}{R \cdot T} \\ P_{cached} & = \frac{R \cdot T - 1}{R \cdot T} \end{aligned} $$ Which is probably better understood as \(P_{miss}\): $$ P_{miss} = \frac{N_{misses}}{N_{hits}+N_{misses}} = \frac{1}{R \cdot T} = 1 - P_{cached} $$ $$ P_{cached} = 1 - \frac{1}{R \cdot T} $$ From this it follows that the average resolving response time is: $$ \begin {aligned} t_{resolve} & = P_{cached} \cdot t_{cached} + (1-P_{cached}) \cdot t_{auth} \\ & = (1 - \frac{1}{R \cdot T}) \cdot t_{cached} + ( \frac{1}{R \cdot T} ) \cdot t_{auth} \end {aligned} $$ Usually \( t_{cached} \ll t_{auth} \), allowing for: $$ t_{resolve} \approx \frac{t_{auth}}{R \cdot T} $$ From this, it can be seen that to improve response times without artificially refetching, we can work on lowering \(t_{auth}\), by picking remote servers that respond quickly, or by raising the effective query rate \(R\).

One avenue for doing that is 'query concentration', making sure that similar queries end up on the same node within a load balanced setup.

Another way of achieving higher effective \( R \) is by adding 'cache peeking' to other nodes within a resolver constellation.

Another cheap way is to artificially increase minimum \( T \) above a certain, non-pathological, value.

Far away resolvers

For resolvers which are further away from the end-user, like Google DNS or OpenDNS, a fixed new term enters the equation: $$ t_{resolve} \approx t_{travel} + \frac{t_{auth}}{R \cdot T} $$ Where \(t_{travel}\) stands for the amount of time a query takes to travel to the actual resolver, and back.
Because \(t_{travel}\) can be significant, it is possible for 'local' resolvers to beat the performance offered by very large resolver clusters that see a stellar effective query rate.

Generalizing this from individual domains to real traffic

The equations above all work for single domains, but it is the sum of these domains, each with their own \(R_{domain}\), \(t_{{auth}_{domain}}\) and \(T_{domain}\) that make up the total experience.
It would be tempting to simply average these three, and to the math, but that would be highly misleading - domains with low \( R \) form a negligable impact on the total experience. Meanwhile, popular domains are more likely than unpopular domains to be hosted on nameservers nearby.
To get real numbers therefore, we need to average the actual numbers over all domains individually: $$ t_{resolve} = \frac{\displaystyle \sum_{domains} R_{domain} \cdot \frac{t_{{auth}_{domain}}}{R_{domain} \cdot T_{domain}}}{R} $$ This weighs each domain based on its individual query rate. And this is where something interesting happens. The above can be simplified to: $$ t_{resolve} = \frac{\displaystyle \sum_{domains} \frac{t_{{auth}_{domain}}}{ T_{domain}}}{R} $$ The per-domain query rate has fallen out of the equation! If this formula is true, only the bulk query rate matters, and concentration would have no effect.