The ultimate SO_LINGER page, or: why is my tcp not reliable

Posted by bert hubert Sun, 18 Jan 2009 13:03:00 GMT

This post is about an obscure corner of TCP network programming, a corner where almost everybody doesn’t quite get what is going on. I used to think I understood it, but found out last week that I didn’t.

So I decided to trawl the web and consult the experts, promising them to write up their wisdom once and for all, in hopes that this subject can be put to rest.

The experts (H. Willstrand, Evgeniy Polyakov, Bill Fink, Ilpo Jarvinen, and Herbert Xu) responded, and here is my write-up.

Even though I refer a lot to the Linux TCP implementation, the issue described is not Linux-specific, and can occur on any operating system.

What is the issue?

Sometimes, we have to send an unknown amount of data from one location to another. TCP, the reliable Transmission Control Protocol, sounds like it is exactly what we need. From the Linux tcp(7) manpage:

“TCP provides a reliable, stream-oriented, full-duplex connection between two sockets on top of ip(7), for both v4 and v6 versions. TCP guarantees that the data arrives in order and retransmits lost packets. It generates and checks a per-packet checksum to catch transmission errors.”

However, when we naively use TCP to just send the data we need to transmit, it often fails to do what we want - with the final kilobytes or sometimes megabytes of data transmitted never arriving.

Let’s say we run the following two programs on two POSIX compliant operating systems, with the intention of sending 1 million bytes from program A to program B (programs can be found here):


      sock = socket(AF_INET, SOCK_STREAM, 0);  
      connect(sock, &remote, sizeof(remote));
      write(sock, buffer, 1000000);             // returns 1000000


    int sock = socket(AF_INET, SOCK_STREAM, 0);
    bind(sock, &local, sizeof(local));
    listen(sock, 128);
    int client=accept(sock, &local, locallen);
    write(client, "220 Welcome\r\n", 13);

    int bytesRead=0, res;
    for(;;) {
        res = read(client, buffer, 4096);
        if(res < 0)  {
        bytesRead += res;
    printf("%d\n", bytesRead);

Quiz question - what will program B print on completion?

A) 1000000
B) something less than 1000000
C) it will exit reporting an error
D) could be any of the above

The right answer, sadly, is ‘D’. But how could this happen? Program A reported that all data had been sent correctly!

What is going on

Sending data over a TCP socket really does not offer the same ‘it hit the disk’ semantics as writing to a normal file does (if you remember to call fsync()).

In fact, all a successful write() in the TCP world means is that the kernel has accepted your data, and will now try to transmit it in its own sweet time. Even when the kernel feels that the packets carrying your data have been sent, in reality, they’ve only been handed off to the network adapter, which might actually even send the packets when it feels like it.

From that point on, the data will traverse many such adapters and queues over the network, until it arrives at the remote host. The kernel there will acknowledge the data on receipt, and if the process that owns the socket is actually paying attention and trying to read from it, the data will finally have arrived at the application, and in filesystem speak, ‘hit the disk’.

Note that the acknowledgment sent out only means the kernel saw the data - it does not mean the application did!

OK, I get all that, but why didn’t all data arrive in the example above?

When we issue a close() on a TCP/IP socket, depending on the circumstances, the kernel may do exactly that: close down the socket, and with it the TCP/IP connection that goes with it.

And this does in fact happen - even though some of your data was still waiting to be sent, or had been sent but not acknowledged: the kernel can close the whole connection.

This issue has led to a large number of postings on mailing lists, Usenet and fora, and these all quickly zero in on the SO_LINGER socket option, which appears to have been written with just this issue in mind:

“When enabled, a close(2) or shutdown(2) will not return until all queued messages for the socket have been successfully sent or the linger timeout has been reached. Otherwise, the call returns immediately and the closing is done in the background. When the socket is closed as part of exit(2), it always lingers in the background.”

So, we set this option, rerun our program. And it still does not work, not all our million bytes arrive.

How come?

It turns out that in this case, section of RFC 1122 tells us that a close() with any pending readable data could lead to an immediate reset being sent.

“A host MAY implement a ‘half-duplex’ TCP close sequence, so that an application that has called CLOSE cannot continue to read data from the connection. If such a host issues a CLOSE call while received data is still pending in TCP, or if new data is received after CLOSE is called, its TCP SHOULD send a RST to show that data was lost.”

And in our case, we have such data pending: the “220 Welcome\r\n” we transmitted in program B, but never read in program A!

If that line has not been sent by program B, it is most likely that all our data would have arrived correctly.

So, if we read that data first, and LINGER, are we good to go?

Not really. The close() call really does not convey what we are trying to tell the kernel: please close the connection after sending all the data I submitted through write().

Luckily, the system call shutdown() is available, which tells the kernel exactly this. However, it alone is not enough. When shutdown() returns, we still have no indication that everything was received by program B.

What we can do however is issue a shutdown(), which will lead to a FIN packet being sent to program B. Program B in turn will close down its socket, and we can detect this from program A: a subsequent read() will return 0.

Program A now becomes:

    sock = socket(AF_INET, SOCK_STREAM, 0);  
    connect(sock, &remote, sizeof(remote));
    write(sock, buffer, 1000000);             // returns 1000000
    shutdown(sock, SHUT_WR);
    for(;;) {
        res=read(sock, buffer, 4000);
        if(res < 0) {

So is this perfection?

Well.. If we look at the HTTP protocol, there data is usually sent with length information included, either at the beginning of an HTTP response, or in the course of transmitting information (so called ‘chunked’ mode).

And they do this for a reason. Only in this way can the receiving end be sure it received all information that it was sent.

Using the shutdown() technique above really only tells us that the remote closed the connection. It does not actually guarantee that all data was received correctly by program B.

The best advice is to send length information, and to have the remote program actively acknowledge that all data was received.

This only works if you have the ability to choose your own protocol, of course.

What else can be done?

If you need to deliver streaming data to a ‘stupid TCP/IP hole in the wall’, as I’ve had to do a number of times, it may be impossible to follow the sage advice above about sending length information, and getting acknowledgments.

In such cases, it may not be good enough to accept the closing of the receiving side of the socket as an indication that everything arrived.

Luckily, it turns out that Linux keeps track of the amount of unacknowledged data, which can be queried using the SIOCOUTQ ioctl(). Once we see this number hit 0, we can be reasonably sure our data reached at least the remote operating system.

Unlike the shutdown() technique described above, SIOCOUTQ appears to be Linux-specific. Updates for other operating systems are welcome.

The sample code contains an example of how to use SIOCOUTQ.

But how come it ‘just worked’ lots of times!

As long as you have no unread pending data, the star and moon are aligned correctly, your operating system is of a certain version, you may remain blissfully unimpacted by the story above, and things will quite often ‘just work’. But don’t count on it.

Some notes on non-blocking sockets

Volumes of communications have been devoted the the intricacies of SO_LINGER versus non-blocking (O_NONBLOCK) sockets. From what I can tell, the final word is: don’t do it. Rely on the shutdown()-followed-by-read()-eof technique instead. Using the appropriate calls to poll/epoll/select(), of course.

A few words on the Linux sendfile() and splice() system calls

It should also be noted that the Linux system calls sendfile() and splice() hit a spot in between - these usually manage to deliver the contents of the file to be sent, even if you immediately call close() after they return.

This has to do with the fact that splice() (on which sendfile() is based) can only safely return after all packets have hit the TCP stack since it is zero copy, and can’t very well change its behaviour if you modify a file after the call returns!

Please note that the functions do not wait until all the data has been acknowledged, it only waits until it has been sent.

Posted in ,  | 3 comments

The fourteen stages of any real software project

Posted by bert hubert Sun, 16 Nov 2008 21:21:00 GMT

  1. Idea - estimates for time to completion range from 3 days to 3 weeks
  2. Pretty convincing first stab ‘look how cool this would be’
  3. The Hard Slog to get something that actually works. Estimates now range from 3 months to 3 years.
  4. First real users pop up, discovery is made that all assumptions were off
  5. Starts to look good to the first real user
  6. Elation!
  7. Someone actually uses the code it for real, the bugs come out in droves
  8. A zillion bugs get addressed, harsh words are spoken
  9. Elation!
  10. The guy you had previously told that 100 million users would not ‘in principle’ be a problem actually took your word for it, and deployed it on said user base. Harsh words are spoken.
  11. Fundamentals are reviewed, large fractions of the code base reworked
  12. Product now actually does what everybody hoped it would do.
  13. Even very unlikely bugs have cropped up by now, and have been addressed. Even rare use cases are now taken into account.
  14. If a user complains of a crash at this stage, you can voice doubts about the quality of his hardware or operating system.

PowerDNS went through all these stages, and took around 5 years to do so. Not all parts are at ‘stage 14’ yet, but for the Recursor, I seriously ask people to run ‘memtest’ if they report a crash.

The above 14 points are never traversed without users that care. For PowerDNS, step ‘4’ was performed by Amaze Internet and step ‘7’ by ISP Services. 1&1 (called Schlund back then) was instrumental in step ‘10’ when they started using it on millions of domains.

For the PowerDNS Recursor, steps ‘4’ and ‘7’ not only happened over at XS4ALL, but they also paid for it all!

Step ‘10’ occurred over at AOL and Neuf Cegetel, who together connected the Recursor to 35 million subscribers or so.

Finally, the parts of PowerDNS that have reached the end of the list above have done so because of literally hundreds if not thousands of operators that have made the effort to report their issues, or voice their wishes.

Many thanks to everybody!

Hmm, the above does not sound very professional..

I’ve heard the theory that some people think they can plan software development more professionally. I used to believe them too. But any real project I’ve heard of went through the stages listed above. No schedule, no Microsoft Project sheet, no Gantt Chart I know about ever even came close to reality.

But I’d love to be wrong, because I agree fully that it would be great if software development was more predictable.

This is especially true since the aforementioned “process” necessarily involves several very committed users, who have to voice the harsh words, but do have to stick with the project.

So please comment away if your real life experiences are different - I’d love to hear!

Posted in ,  | no comments

Some debugging techniques, and "C++ introspection"

Posted by bert hubert Thu, 18 Sep 2008 19:44:00 GMT

After too much posting on IETF mailing lists, and not achieving anything, I’ve gone back to coding a bit more.

There are two things I want to share - the first because I had a devil of a time figuring out how to do something, and I hope that posting here will help fellow-sufferers find the solution via Google. The second thing I want to talk about because, and this is getting to be rare, I programmed something cool, and I just need to tell someone about it.

I pondered explaining it to my lovely son Maurits (4 months old today), but I don’t want to ruin his brain.

Debugging iterators

In most programming languages there are a lot of things that compile just fine, or generate no parsing errors at runtime, but are still accidents waiting to happen.

Tools abound to expose such silent errors, usually at a horrendous performance cost. But this is fine, as errors can be found by the developer, and fixed before release.

As part of our arsenal, we have the veritable Valgrind that detects things such as reading from memory that had not previously been written to. In addition, other tricks are available, such as changing functions that ‘mostly do X, and rarely Y’ so that they always to Y. This quickly finds programs that skipped dealing with Y (which might be a rare error condition, or realloc(2) returning a new memory address for your data).

Finally, many programming environments by default perform very little checking (in the interest of speed) - for example, they will gladly allow you to compare something that points to data in collection A to something that points to collection B - a comparison that never makes sense, classical apples and oranges.

My favorite C++ compiler, G++, comes with so called ‘debugging iterators’ that are very strict - anything that is not absolutely correct becomes an error, sometimes at compile time, sometimes at runtime.

Together with Valgrind, this is one of the techniques I like to whip out when the going gets tough.

Sadly, Debugging iterators (which are turned on by adding -DGLIBCXXDEBUG conflict with one of my favorite C++ libraries, Boost.

To make a long story short, to compile a version of Boost with debugging iterators, issue:

$ bjam define=_GLIBCXX_DEBUG

This single line of text may not look all that important, but it took me half a day of debugging to figure this out. So if you get this error:

(.rodata._ZTVN5boost15program_options11typed_valueISscEE[vtable for 
boost::program_options::typed_value, std::allocator >, char>]+0x18): 
undefined reference to 
std::allocator >, std::allocator, std::allocator > > > const&, bool) const'

Then compile your own version of boost as outlined above.

C++ Introspection & Statistics

C++ is an old-school language, perhaps the most modern language of the old school. This means that it sacrifices a lot of things to allow programs to run at stunning ‘near bare metal’ speeds. One of the things that C++ does not offer therefore is ‘introspection’

What this means is that if you have a class called “ImportantClass”, that class does not know its own name at runtime. When a program is running, it is not possible to ask by name for an “ImportantClass” to be instantiated.

If you need this ability, you need to register your ImportantClass manually by its name “ImportantClass”, and store a pointer to a function that creates an ImportantClass for you when you need it.

Doing so manually is usually not a problem, except of course when it is. In PowerDNS, I allocate a heap (or a stack even) of runtime statistics. Each of those statistics is a variable (or a function) with a certain name.

In more modern languages, it would probably be easy to group all these variables together (with names like numQueries, numAnswers, nomUDPQueries etc), and allow these statistics to be queried using their names. So, an external program might call ‘get stat numQueries’, and PowerDNS would look up the numQueries name, and return its value.

No such luck in C or C++!

So - can we figure out something smart, say, with a macro? Yes and no. The problem is that when we declare a variable in C which we want to be accessible from elsewhere in the program, it needs to happen either inside a struct or class, or at global scope. This in turn means that we can’t execute code there. So, what we’d like to do, but can’t is:

struct Statistics {
         uint64_t numPackets;
         registerName(&numPackets, "numPackets");
         uint64_t numAnswers;
         registerName(&numAnswers, "numAnswers");
} stats;

stats.numPackets is indeed available, but the line after its definition will generate an error. This is sad, since the above could easily be generated from a macro, so we could do:

DEFINESTAT(numPackets, “Number of packets received”);

Which would simultaneously define numPackets, as well as make it available as “numPackets”, and store a nice description of it somewhere.

But alas, this is all not possible because of the reasons outlined above.

So - how do we separate the data structure from the ‘registerName()’ calls, while retaining the cool ‘DEFINESTAT’ feature where everything is in one place?

In C++, files can be included using the #include statement. Most of the time, this is used to include so called ‘header’ files - but nothing is stopping us from using this feature for our own purposes.

The trick is to put all the ‘DEFINESTAT’ statements in an include file, and include it not once, but twice!

First we do:

#define DEFINESTAT(name, desc) uint64_t name;
struct Statistics {
#include "statistics.h"
} stats;

This defines the Statistics struct, containing all the variables we want to make available. These are nicely expanded using our DEFINESTAT definition.

Secondly, we #undefine DEFINESTAT again, and redefine it as:

#define DEFINESTAT(name, desc) registerName(&, #name, desc);

Then we insert this in a function:

void registerAllNames()
#include "statistics.h"

This will cause the same statistics.h file to be loaded again, with the same DEFINESTAT lines in there, but this time DEFINESTAT expands to a call that registers each variable, its name (#name expands to “name”), and its description.

The rest of our source can now call ‘stats.numPackets++’, and if someone wants to query the “numPackets” variable, it is available easily enough through its name since it has been registered using registerName.

The upshot of this all is that we have gained the ability to ‘introspect’ our Statistics structure, without any runtime overhead nor any further language support.

As stated above, more modern languages make this process easier.. but not as fast!

I hope you enjoyed this arcane coolness as much as I did. But I doubt it :-)

Posted in , ,  | no comments

Some thoughts on the recent DNS vulnerability

Posted by bert hubert Wed, 09 Jul 2008 19:31:00 GMT

Yesterday it was announced that there is an unspecified but major DNS vulnerability, and that Microsoft, Nominum and ISC had fixes available.

It is amusing to note that this has been hailed as a major feat of cooperation, with the vulnerable parties spinned as being part of secret industry cabal that has just saved the world from very bad things.

To say the least, I find this a funny way of presenting things! The vulnerability is still not public, but the secret cabal shared it with me. Perhaps it is fair to say I am part of the cabal - I nearly traveled to the secret meeting at the Microsoft campus, but the imminent birth of my son made me decide not to go.

The DNS vulnerability that has been presented yesterday is indeed a very serious problem, and I am glad steps are now taken to fix the broken software that was vulnerable. Dan Kaminksy is to be praised for discovering the issue and coordinating the release.

However - the parties involved aren’t to be lauded for their current fix. Far from it. It has been known since 1999 that all nameserver implementations were vulnerable for issues like the one we are facing now. In 1999, Dan J. Bernstein released his nameserver (djbdns), which already contained the countermeasures being rushed into service now. Let me repeat this. Wise people already saw this one coming 9 years ago, and had a fix in place.

In 2006 when my own resolving nameserver entered the scene, I decided to use the same security strategy as implemented in djbdns (it is always better to steal a great idea than to think up a mediocre one!). Some time after that, I realised none of the other nameservers had chosen to do so, and I embarked on an effort to move the IETF DNS-EXT working group to standardise and thus mandate this high security behaviour.

This didn’t really go anywhere, but some months ago I noticed particularly strenuous resistance in the standardisation of the so called ‘forgery resilience draft’, and after some prodding it became clear it was felt my draft was in danger of drawing attention to the then unannounced DNS vulnerability, and that it were best if we’d all shut up about it for a few months, perhaps until July 2008 until all the vendors would have had time to get their act together.

And now we’ve seen the release, and it is being hailed as great news. But it isn’t. Dan Bernstein has been ignored since 1999 when he said something should be done. I’ve been ignored since 2006. The IETF standardisation languished for two years.

This is not a success story. It has in fact been a remarkable failure.

To end on a positive note - I am very glad Dan Kaminsky’s work caused some collective eye opening, and I hope good things come from this. DNS has long lacked critical attention, and in the end this might bring about sorely needed improvements.

DNS very recently celebrated its 25th birthday - I look forward to seeing the venerable Domain Name System succeed in the coming 25 years!

Posted in ,  | 8 comments

DNS & Crypto Power Lunch

Posted by bert hubert Wed, 21 Feb 2007 22:13:00 GMT

Enjoyed a fun and stimulating “DNS & Crypto Power Lunch” with Dan Bernstein (left) and Tanja Lange (not in picture). As was to be expected, the intersection of cryptography and (secure) DNS was discussed, and some evil plans might ensue! If implemented in djbdns and PowerDNS, we might actually achieve something..

Posted in , ,  | no comments

(a)synchronous programming

Posted by bert hubert Sun, 04 Feb 2007 12:14:00 GMT

Ok, I’m going to lecture a bit, a bad habit of mine. The summary is that an important enhancement of the Linux kernel has been proposed, but in order to understand the significance of this enhancement, you need a lot of theory, which follows below.

I use the word “computer” sometimes when I properly mean “the operating system”. This exposes a problem of this post, I’m trying to explain something deeply theoretical to a general audience. Perhaps it didn’t work. See for yourself.

Doing many things at once

People generally tend not to be very good at doing many thing at once, and surprisingly, computers are not much different in this respect.

First about human beings. We can do one thing at a time, reasonably well. There are people that claim they can multi-task, but if you look into it, that generally means doing one thing that is really simple, while simultaneously talking on the phone.

This is exemplified by how we answer a second phone call, ie, by saying “The other line is ringing, I’ll call you back”, or conversely, telling the other line they’ll have to wait.

We emphatically don’t try to have two conversations at once, and even if we had two mouths, we still wouldn’t attempt it.

Let’s take a look at a web server, the program that makes web pages available to internet browsers. The basic steps are:

  1. Wait for new connections from the internet
  2. Once a new connection is in, read from it which page it wants to see (for example, ‘GET HTTP/1.1’).
  3. Find that page in the computer
  4. Send it to the web browser that connected to us
  5. Go to 1.

Compare this to answering a phone call, step 1 is the part where you wait for the phone to ring, and answering it when it does. Step 2 is hearing what the caller wants, step 3 is figuring out the answer to the query, 4 is sharing that answer.

This all seems natural to us, as it is the way we think. And programmers, contrary to what people think, are human beings, too.

Where this simple process breaks down is that, much like a regular phone call, we can only serve a new web page once the old one is done sending.

And here is where things get interesting - although we people have a hard time doing multiple things at once, we can give the problem to the computer.

What is the easiest way of doing so? Well, if we want to increase the capacity of a telephone service we do so.. by adding people. So on the programming side of things, we do the same thing, only virtually: we order the computer (or more exactly, the operating system) to split itself in two!

The new list of steps now becomes:

  1. Wait for new connections from the internet
  2. Once a new connection is in, split the computer in two.
  3. One half of the computer goes back to step 1, the other half continues this list
  4. (2) Read from it which page it wants to see (for example, ‘GET HTTP/1.1’).
  5. (2) Find that page
  6. (2) Send it to the web browser
  7. (2) Done - remove this “half” of the computer

I’ve prefixed the things the second computer does with ”(2)” . This looks like the best of both worlds. We can “serve” many web pages at the same time, and we didn’t need to do complicated things. In other words, we could continue thinking like human beings, and use our intuition, by thinking of the analogies with answering phone calls.

So, are we done now? Sadly no. What basically has happened is that we have invoked a piece of magic: let’s split the computer in two. That is all fine, but somebody has to do the splitting. This job is farmed out to the CPU (the processor) and the operating system (Windows, Linux etc), and they have to deal with making sure it appears the computer can do two things at the same time.

Because the truth is.. people can’t do it, and neither can computers. They fake it.

This faking comes at a cost, incurred both while splitting the computer (“forking”), and by making the computer juggle all its separate parts. Finally, it turns out that practically speaking, you can divide a computer up into only a limited number of parts before the charade falls down.

Busy websites have tens of millions of visitors, we’d need to be able to split the computer into at least that many parts, while in practice the limit lies at perhaps 100,000 slices, if not less.

Now what

Several solutions to this problem have been invented. Some involve not quite splitting up the entire computer and making split parts share more of the resources (like for example, memory). This is called ‘threading’. Perhaps this could be compared with not hiring more people to answer the telephone, but instead giving the people you have more heads, so as to save money.

In the end, all these solutions run into a brick wall: it is hard to maintain the illusion that the computer can do multiple things at the same time, AND have it actually do a million things at the same time.

So in the end, we have to bite the bullet, and just make sure the program itself can handle many many things at once, without needing the magic of pretending the computer can do it for us.

“Asynchronous programming”

This is where things get hard, and this is to be expected, as it was our basic premise that people can’t do multiple things at the same time, and what’s worse, they have a hard time even thinking about what it would be like.

The new algorithm looks like this:

  1. Instruct the computer to tell us when “something has happened”
  2. Figure out what happened:
    • If there is a new connection, instruct the computer that from now on, it should tell us if new data arrived on that connection
    • If something has happened to one of those connections we’ve told the computer about, read the data sent to us on that connection. Then find the information requested on that connection, and instruct the computer to tell us when there is “room” to send that data
    • If the computer told us there was “room”, send the data that was previously requested on that connection. If we are done sending all the data, tell the computer to disconnect, and no longer inform us of the state of the connection.
  3. Go back to 1.

If this feels complicated, you’d be right. However, this is how all very high performance computer applications work, because the “faking” described above doesn’t really “scale” to tens of thousands of connections.

How does this translate to the telephone situation? It would be like we have lots of small answering machines, that lots of callers can talk to at the same time. Whenever someone has finished a question, the operator would listen to that answering machine, and leave the answer on the machine, and go on to the next machine that has a finished message.

From this description, it is clear it would not work faster that way if you’d try it for real. However, in many countries, if you call a directory service to find a telephone number, you’ll get half of this. Your call is answered by a real human being, who asks you questions to figure out which phone number you are looking for. But once it has been found, the operator presses a button, and the result of your query is sent to a computer, which then reads it to you, allowing the operator to already start answering a new call. Rather smart.

Something in between

If the previous bit was hard to understand, I make no apologies, this is just how complicated things are in the world of computing. However, we programmers also hate to deal with complicated things, so we try to avoid stuff like this.

People have invented many ways of allowing programmers to think ‘linearly’, as if only a single thing is happening at the same time, without having to split the entire computer.

One way of doing this is having a facade that makes things go linearly, until the program has to wait for something (a new connection, “room” to send data etc), and then switch over to processing another connection. Once that connection has to wait for something, chances are that what our earlier ‘wait’ was waiting for has happened, and that program can continue.

This truly offers us the best of both worlds: we can program as if only a single thing is happening at the same time, something we are used to, but the moment the computer has to wait for something, we are switched automatically to another part of the program, that is also written as if it is the only thing happening at the same time.

Actually making this happen is pretty hard however, because traditional computer programming environments don’t clearly separate actions that could lead to “waiting” from actions that should happen instantly.

A prime example of the first kind of action is “waiting for a new connection” - this might in theory take forever, especially if your website is really unpopular.

Things that should happen instantly include for example asking the computer what time it thinks it is.

Traditional operating systems can be instructed to be mindful of new incoming connections, and not keep the program waiting for them. This is what we described in the complicated “if X happened, if Y happened” scenario above.

They can also do the same for reading from the network and writing to the network, both things that might take time. This means you can ask the operating system ‘let me know when I can read so I don’t have to wait for it, and I can process other connections in the meantime’.

Furthermore, there are some limited tricks to do the same for reading a file. The problem is that back in the 1970s when most operating system theory was being invented, disks were considered so fast, nobody thought it possible you’d ever need to meaningfully wait for one. Of course disks weren’t faster back then, but computers were slower, and massively so. So by comparison, disks were really fast.

The upshot is that in most operating systems, disk reads are grouped with “stuff that should happen instantly”, whereas every computer user by now has experienced this is emphatically not the case.

Modern operating systems offer only a limited solution to this problem, called ‘asynchronous input/output’, which allows one to more or less tell the computer to notify us when it has read a certain piece of data from disk.

However, it doesn’t offer the same facility for doing a lot of other things that might take time, like finding the file in the first place, or opening it. Things that in the real world take a lot of time.

So, we can’t truly enjoy the best of both worlds as sketched above, which would mean the programmer could write simple programs, which would be switched every time his program has to wait for something.

Enter ‘Generic AIO’

Zach Brown, who is employed by Oracle to work on Linux, has now dreamed up something that appears to never have been done before: everything can now be considered something that “might take time”.

This means that you can ask Linux to find a certain file for you, and immediately allows you to process other connections that need attention. Once the operating system has found the file for you, it is available for you without waiting.

Although almost every advance in operating system design has at one point been researched already, this approach appears to be rather revolutionary.

It has ignited vigorous discussion within the Linux community about the feasibility of this approach, and if it truly is the dreamt of “best of both worlds”, but to this author, it surely looks like a breakthrough.

Especially since it unites the worlds of “waiting on a read/write from the network” with “waiting for a file to be read from disk”.

Time will tell if “Generic AIO” will become part of Linux. In the meantime, you can read more about it on LWN.

Posted in , ,  | 2 comments

This draft is a work item of the DNS Extensions Working Group of the IETF!

Posted by bert hubert Fri, 12 Jan 2007 21:16:00 GMT

The workings of the Internet are described, or even proscribed, by the so called ‘Requests For Comments’, or RFCs. These are the laws of the internet.

Today the IETF DNS Extensions working group accepted an “Internet-Draft” Remco van Mook and I have been working on. And the cool bit is that over time, many such accepted “Internet-Drafts” turn into RFCs!

Read about it what our draft does here and here.

The actual Internet-Draft can be found over at the IETF, or over here as pretty HTML.

In short, this RFC documents and standardises some of the stuff DJBDNS and PowerDNS have been doing to make the DNS a safer place.

Besides the fact that it is important to update the DNS standards to reflect this practice, it is also rather a cool thought to actually be writing an RFC, especially one that has the magic stanzas “Standards Track” and “Updates 1035” in it.

So we are well pleased! Over the coming months we’ll have to tune the draft so it confirms with the consensus of the DNSEXT working group, and hopefull somewhere around March, it will head towards the IESG, after which an actual RFC should be issued.


Posted in , , ,  | 10 comments

Wishing you a good 2007!

Posted by bert hubert Mon, 01 Jan 2007 15:58:00 GMT

I wish everybody a very good 2007! For PowerDNS, it certainly has been a very good year.

In some (large) places, the Recursor now commands a 40% market share, while the authoritative server is also expanding its user base around the world, with multi-million domain deployments now no longer as newsworthy as they once were.

The Chaos Computer Club held its annual congress last week, and they chose the PowerDNS Recursor to provide the DNS service to go with their 10 gigabit connection. I’m pleased to report that the PowerDNS process was fired up only once, and that it held steady for the entire congress, with no complaints. This would usually not be that strange, but the CCC clientèle are among the most critical internet users to be found on the planet.

Many thanks to Stefan Schmidt and other CCC admins for their vote of confidence!


I’m working on understanding ‘Ruby on Rails’, which will probably end up as a HOWTO aimed at seasoned programmers. The internet abounds with “you won’t believe how easy Ruby on Rails is” demonstrations, but the hard truth is that below the surface, a lot of magic is happening. The kind of magic the discerning programmer wants to grasp so as to make the most of it.

A very small start to this HOWTO can be found here.

It may also allow experience programmers to teach themselves Ruby in less time than it would take them to read a 750 page book.

Posted in , , ,  | 7 comments

PowerDNS speedups

Posted by bert hubert Thu, 14 Dec 2006 21:21:00 GMT

After PowerDNS 3.1.4 turned out to be boringly stable, fixing all reported crashes, I decided it was time to do the massive speedup I’d been promising people for some time.

With some help from my friends over at #offtopic2, I was able to use the TSC register of my CPU to measure down to the nanosecond how much time things were taking within PowerDNS. Previously I’d concentrated on profiling macro performance, but nanosecond resolution allows one to study fully how much time is spent within each function.

Using this technique, it became apparent we take a whopping 60 microseconds to answer even the most basic of questions. We make up for this by being pretty fast at complicated questions. But 60 microseconds mean we are limited to about 15000 questions/second, max.

First I started shaving microseconds. It turns out snprintf is truly slow, taking up to 5 microseconds for some strings. Additionally, we wasted a lot of time on needlessly copying std::strings.

The unsurpassed Boost::Multi_Index container has a spectacular feature, called ‘compatible keys’, which means we can lookup answers using a question key that is a bare piece of memory instead of a proper std::string. This again saved a few microseconds.

Put together, this brought down the 60 usec to perhaps 40, which is nice, but not stunning.

But the big savings only came when I did the only thing that actually makes code fast: do less.

So - when encoding the answer to a question, we no longer do the whole “DNS label compression”-routine, as we know the “label” of the answer to a question can always be encoded as the fixed bytes 0xc00c - we don’t need to calculate it.

Going beyond that, when generating a simple answer, don’t generate an answer packet, but simply tack on the answer to the original question, and update the ‘answer count’.

Also, if we see we have an ‘instant answer’ available for a question, don’t bother to launch a whole ‘MThread’ to generate it, but return synchronously.

The upshot of all this is that we can now answer most questions in… 4 microseconds, down from 60. 15-fold speedups are rather rare usually.

We didn’t speedup everything that much though, only the majority of queries. However, even the uncached queries will benefit from the microsecond shaving performed earlier, and run around twice as fast.

I can’t wait to do a live benchmark on all this. I’m estimating we should now be able to do over 50000 “real” queries/second on a 3GHz P4, which would put us an order of magnitude above the open source competition, and even beat, by a large factor, the numbers I hear quoted for commercial alternatives. These are hard to compare as their numbers are under NDA.

It might not even be easy to generate that much testing data..

Will keep you posted!

Posted in , ,  | 1 comment

Visited ASML yesterday, wow

Posted by bert hubert Fri, 24 Nov 2006 21:55:00 GMT

Yesterday I visited a “software development seminar” of ASML, a rather well disguised recruiting event of this Dutch manufacturer of the world’s most advanced lithography machines.

When I studied physics, I organized the Delftse Bedrijvendagen, the then largest carreer fair for university students of The Netherlands. As part of that, I was exposed to almost all recruiters of large Dutch companies, including ASML. And the ASML people never failed to leave me light headed.

In brief, lithography is a major piece of the process of actually making chips. It is the part where you actually put the chip on the substrate, using high energy photons. Current 65nm chips consist of many layers, each of these layers needs to be overlaid with the previous one to a precision of a few nanometres.

To achieve this precision, the individual positioning tolerances of the wafer need to be exact within a nanometre. This is a stunning achievement in itself. For those of you in the non-metric world, there are around 25 million nanometres to an inch. So you should be impressed.

However, this is nothing yet. The lithography machines (‘wafer steppers’) are very expensive, as is the facility that hosts them. And, as there are many layers in a chip, the actual speed of the wafer stepper is of utmost importance.

The machines ASML builds actually illuminate the ‘reticle’ at speeds exceeding 5 metres a second. This is 11 miles/hour. At nanometre precision.

You should have progressed beyond “impressed” to “stunned” by now.

But this is nothing yet. As in microscopy, where water is used to improve resolution, it makes sense to immerse your chip in water while it is being exposed. So the ASML people do that. At nanometre precisions, at those stunning speeds.

To put things in perspective, the wafer is NOT flat to within a nanometre, it bends a bit. So to achieve the precision desired, the wafer is first scanned, so all its imprecisions can be compensated for.

Extreme stuff. I’m sure they don’t have this in “Star Trek”.

I left the event deeply confused - I’m already completely busy with everything I do, and PowerDNS is getting to be quite the empire. The rest of my business is doing great as well.

But my physics background makes me appreciate the incredible things happening over at ASML. Oh well. Like any job, I’m sure it would have downsides. Also, I’m not the kind of person to hold a regular job. But if you want to do stuff on the leading edge of technology, you should at least consider working there. I hear they have 300 vacancies planned for software engineers. They also have some blogs, by the way.

Their current challenge is to move their 15 million lines of C to a new platform that will control their next generation of devices, some of which need to move terabyte amounts of data in under a second.

Anyhow, the seminar was interesting. Tom Gilb presented his “Evolutionary Project Management” concepts, which match rather well with how I tend to manage my projects. One of his main points is that when people start to apply “waterfall” diagrams to software projects, you are lost anyhow. I thought so all along, but it is nice to hear a “guru” confirm it.

Inspired by the breakthrough technologies over at ASML, I’ve picked up my own speech recognition research again, after an 18 month hiatus. The initial results bode well. I get very good frequency and time definition on real speech, with code totalling 750 lines. I hope to get some actual recognition going in the coming week.

Posted in , ,  | 9 comments

Older posts: 1 2 3 4