Reusing UNIX semantics for fun and profit

Posted by bert hubert Thu, 20 Sep 2007 19:57:00 GMT

I’ve long been a fan of some of the techniques Dan Bernstein uses to leverage the power of UNIX to achieve complicated goals with little effort. For example, he uses a technique called Chain Loading to clearly separate and insulate several programs from each other by loading a new program *in place* of the current one, once a critical task has been performed, like checking a user’s credentials.

This guarantees that the outer program, that might actually be exposed to the internet, can restrict itself to very basic functionality, and only launch an inner, more useful program once authentication has completed.

Other tricks are to leverage UNIX user names to insulate various programs from each other, leaving the task of getting the access control details right to the very well tested operating system (which we need to rely on anyhow).

While sometimes unconventional, techniques such as those described above can simultaneously reduce code complexity AND increase security, by more or less hitching a ride on top of existing functionality.

Some time ago, I was involved in the development of a computer program with a classic ‘producer/consumer’ problem. We were inserting events in the database, and wanted to scale by getting a dedicated and very fast database server. To our surprise, getting an additional, and far more powerful system did not improve our performance, and in fact made things far worse.

What happened? It turns out we were doing a lot of small inserts into the database, and even while we were using a transaction, each of these inserts incurred a slight latency penalty, caused by the query & answer packets having to travel over the network. And when doing hundreds of thousands of queries, even half a millisecond is a lot of time. Add in operating system and TCP overhead, and the end to end latency is probably even higher. The obvious solution is to no longer actually wait for the inserts to complete, but to transmit them to the database asynchronously, and continue to do useful work while the packets are in flight and being processed. This way, no time is wasted waiting.

Since most database APIs are synchronous, a separate helper thread of execution needs to be spawned to create the fiction of asynchrony, and this is where things get interesting.

In the PowerDNS nameserver, a complicated ‘Distributor’ abstraction is used to send queries to database threads, and this Distributor contains locks, semaphores and a zoo of other concurrent programming techniques to make things work well. For example, we need to perform checks to see if we aren’t building up an unacceptable backlog of queries, and block if we find we are. This comes with additional choices as to when to unblock etc. I was not looking forward to reimplementing such a thing.

Additionally, our database interface needed to offer an extra feature: every once in a while a query comes along that we DO need to wait for, and because of coherency issues, such a query can only be executed once all queries ‘in flight’ have finished.

So we spent some time pondering this, and suddenly it dawned on me that many of the features we needed exactly match the semantics of the venerable UNIX ‘pipe’.

A pipe is normally used to communicate between two processes, as exemplified by this sample shell script command, which shows us the largest directories on a disk:

$ du | sort -n

The program ‘du’ generates a list of directories and their sizes, which is then fed to sort which outputs this in ascending order. However, nothing prohibits us from using a pipe to communicate with ourselves - and as such it might be a might fine conduit to pass database queries through to our database worker thread.

This has some very nice benefits. Pipes are incredibly efficient, since a lot of UNIX performance depends on them. Additionally, they implement sane blocking behaviour: if too much data is stuck in the pipe, because the other process does not take it out again quickly enough, the sending process automatically blocks. The operating system implements high and low water marks to make this (un)blocking happen efficiently.

Furthermore, pipes guarantee that data up to a certain size can either be written as a whole, or not written at all - making sure we don’t have to deal with partial messages.

Finally, pipes automatically detect when the process on the other end of them has gone away, or has closed its end of the pipe.

However, not all is good. In order to transmit something over a pipe, it must be serialised into bytes - we can’t transmit ready to use objects over them. Additionally, because pipes implement ‘stream’ behaviour, we need to delineate one message from the other, because the pipe itself does not say where a message begins and ends - unlike datagram sockets for example.

And this is the clever bit of our idea. As stated above, pipes are usually employed to transmit data from one process to the other. In our case, the pipe goes from one thread of execution to the other - within the same process, and thus within the same memory space. So we don’t need to send serialized objects at all, and can get away with transmitting pointers to objects. And the nice thing is, pointers all have the same (known) length - so we can do away with both delineation and serialisation.

Additionally, pointers are a lot smaller than most messages, which means we can stuff more messages in the same (fixed) size of the pipe buffer.

So, are we done now? Sadly no - we have the additional need to be able to ‘flush the pipe’ in order to perform synchronous queries that we do need to wait for.

This is where things get complicated, but for those who really want to know, I’ll explain it here. It took almost a day of hacking to get it right however, and I’m explaining it for my own benefit as much as for that of the reader, since I’m bound to forget the details otherwise.

If a synchronous query comes along, we need to flush the pipe, but UNIX offers no such ability. Once we’ve written something to a pipe, all the kernel guarantees us is that it will endeavour to deliver it, but there is no system call that allows us to wait for all data to actually be delivered.

So we need to find a way to signal a ‘write barrier’, and the obvious way to do so is to send a NULL pointer over the pipe, which tells the other end we want to perform a synchronous query. Once the worker thread has seen the NULL pointer, it unlocks the single controlling mutex (which is the return signal that says “got you -the pipe is empty”), and then waits for further pointers to arrive.

Meanwhile, the sending thread tries to lock that same mutex immediately after sending the NULL pointer, which blocks since the receiving thread normally holds the lock. Once the lock succeeds, this tells us the worker thread has indeed exhausted all queries that were in flight.

The sending thread now performs its synchronous database work, knowing the database is fully coherent with all queries it sent out previously, and also knowing the worker thread is not simultaneously accessing the connection - since it is instead waiting for a new pointer to arrive.

If our program now wants to perform further asynchronous queries it can simply transmit further pointers to the worker thread - which oddly enough does not need to retake the mutex. This is what caused us many hours of delay, because intuitively it seems obvious that once the sending thread is done, it must release the mutex so the worker thread can retake it.

As it turns out, doing so opens a whole world of nasty race conditions which allow synchronous queries to ‘jump the queue’ of asynchronous queries that are in flight and have not yet arrived.

So, the sequence is that the worker thread only unlocks the mutex, while the sending thread only locks it.

And this basically is it! So how much lines of code did we save by using the magic of UNIX pipes? The pipe handling code takes all of 90 lines, whereas the Distributor code of PowerDNS takes a round 300, even though it does not offer synchronous queries, does not automatically block if too many queries are outstanding, and most certainly couldn’t implement the sensible wakeup ability that UNIX pipes do offer.

Oh, and you might be wondering by now, did it help? Indeed it did - our program is now at least 20 times faster than it used to be, and there was much rejoicing.