Posted by bert hubert Thu, 14 Dec 2006 21:21:00 GMT
After PowerDNS 3.1.4 turned out to be boringly stable, fixing all reported crashes, I decided it was time to do the massive speedup I’d been promising people for some time.
With some help from my friends over at #offtopic2, I was able to use the TSC register of my CPU to measure down to the nanosecond how much time things were taking within PowerDNS. Previously I’d concentrated on profiling macro performance, but nanosecond resolution allows one to study fully how much time is spent within each function.
Using this technique, it became apparent we take a whopping 60 microseconds to answer even the most basic of questions. We make up for this by being pretty fast at complicated questions. But 60 microseconds mean we are limited to about 15000 questions/second, max.
First I started shaving microseconds. It turns out
snprintf is truly slow, taking up to 5 microseconds for some strings. Additionally, we wasted a lot of time on needlessly copying
The unsurpassed Boost::Multi_Index container has a spectacular feature, called ‘compatible keys’, which means we can lookup answers using a question key that is a bare piece of memory instead of a proper
std::string. This again saved a few microseconds.
Put together, this brought down the 60 usec to perhaps 40, which is nice, but not stunning.
But the big savings only came when I did the only thing that actually makes code fast: do less.
So - when encoding the answer to a question, we no longer do the whole “DNS label compression”-routine, as we know the “label” of the answer to a question can always be encoded as the fixed bytes 0xc00c - we don’t need to calculate it.
Going beyond that, when generating a simple answer, don’t generate an answer packet, but simply tack on the answer to the original question, and update the ‘answer count’.
Also, if we see we have an ‘instant answer’ available for a question, don’t bother to launch a whole ‘MThread’ to generate it, but return synchronously.
The upshot of all this is that we can now answer most questions in… 4 microseconds, down from 60. 15-fold speedups are rather rare usually.
We didn’t speedup everything that much though, only the majority of queries. However, even the uncached queries will benefit from the microsecond shaving performed earlier, and run around twice as fast.
I can’t wait to do a live benchmark on all this. I’m estimating we should now be able to do over 50000 “real” queries/second on a 3GHz P4, which would put us an order of magnitude above the open source competition, and even beat, by a large factor, the numbers I hear quoted for commercial alternatives. These are hard to compare as their numbers are under NDA.
It might not even be easy to generate that much testing data..
Will keep you posted!