100% nerd day today
Posted by bert hubert Wed, 05 Apr 2006 15:53:00 GMT
This is not what we call a day you can tell your friends about. Woke up too early given the time I went to sleep and started hacking away and am still at it.
Today I solved a number of very major PowerDNS problems. Makes you wonder how the program ever worked.
Ok, I did do some other things, I ordered a Wifi directional antenna for my father, who wants to be able to access the internet from his holiday location.
On to the PowerDNS bugs.
“Phantom inability to resolve . records”
This happened every once in a while, immediatly on startup. And the problem disappears if you turn on --trace
output, so it is nearly impossible to debug. I was about to consider a career in carpenting when, just for fun, I decided to let things run through valgrind.
Lo and behold, it pointed straight to the problem. For generating proper errors, PowerDNS passes an ‘error’ variable around, by reference. This means that the compiler is not sure if you are accessing it uninitialized, and emits no warnings.
Valgrind however paints the entire memory of your application and detects reads from unpainted areas. More power to Valgrind! See commit 658.
“Phantom debugging output on querying ipwhois.rfc-ignorant.org”
This one really had me hunting for ghosts. Just after I committed a heap of cleanups which seemed obvious enough, I ran my benchmarks, with debuging output turned off.
But output appeared… And only for ipwhois.rfc-ignorant.org. Everything about the output was wrong. Earlier in the day, I had tweaked debuging output a bit because of the bug that disappeared while tracing, so I investigated if something was left, but it wasn’t.
I tried lots of other domains, but it only happened with ipwhois.rfc-ignorant.org.
One thing that was very off were the packet counters being displayed, which provided the hint.
Feast your eyes on this:
QUESTION SECTION:
ipwhois.rfc-ignorant.org. IN NS
ANSWER SECTION:
ipwhois.rfc-ignorant.org. 31533451 IN NS localhost.rfc-ignorant.org.
ADDITIONAL SECTION:
localhost.rfc-ignorant.org. 1051 IN A 127.0.0.1
And in the background.. I had another PowerDNS recursor running on port 53, which duly received questions from the version being benchmarked on port 5300. And that version did have debuging output turned on.
Over 8000 SOA records for .COM
Ok, this is a big one. Various people had been reporting that under prolonged and heavy load, PowerDNS would slow down. To a certain extent this is normal: the cache grows, memory gets fragmented, but what people told me was far bigger than this.
On investigating another problem this morning, I dumped the cache of a machine that had been testing for 12 hours.. and found 8000 SOA records for .COM, all slightly different.
Turns out that the SOA serial of .COM gets raised many times during the day but that PowerDNS does not let these new records supplant the old SOA record, but simply adds them.
Why would this slow down the recursor? On emitting an NXDOMAIN, PowerDNS will look up the SOA record of a domain.. and find thousands of records, and try to add all of them to the answer. Luckily it stops after 512 bytes, but still.
I fixed this in commit 657.
Massive numbers of PTR records
Ok, try to resolve 10.64.158.85.in-addr.arpa
. It has 877 PTR records, for a total TCP packet of 22 kilobytes. PowerDNS valiantly tries to compress these 877 labels, but failed because you can’t refer to labels with offsets of more than 16384 bytes. Except that the recursor did not care for this limitation and happily inserted larger offsets.
I also removed some ‘magic’ code that tried to be way too smart wrt label compression, but I wonder if the code had some magic uses I can’t figure out right now. Removing it didn’t seem to break anything though.
Tweaking of CLOCK record eviction
Turned out we weren’t scanning enough of the cache to evict enough records to keep a steady cache size. Still an area to watch though. It does appear that the max-cache-entries
setting works very well, allowing operators to set a maximum size measured in cache entries.