<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>bert hubert finally blogs: Reusing UNIX semantics for fun and profit</title>
    <link>http://blog.netherlabs.nl/articles/2007/09/20/reusing-unix-semantics-for-fun-and-profit</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>code, musings and more</description>
    <item>
      <title>Reusing UNIX semantics for fun and profit</title>
      <description>&lt;p&gt;I&amp;#8217;ve long been a fan of some of the techniques &lt;a href="http://en.wikipedia.org/wiki/Daniel_J._Bernstein"&gt;Dan
Bernstein&lt;/a&gt; uses to
leverage the power of UNIX to achieve complicated goals with little effort.
For example, he uses a technique called &lt;a href="http://en.wikipedia.org/wiki/Chain_loading"&gt;Chain
Loading&lt;/a&gt; to clearly separate and
insulate several programs from each other by loading a new program *in
place* of the current one, once a critical task has been performed, like
checking a user&amp;#8217;s credentials.&lt;/p&gt;

&lt;p&gt;This guarantees that the outer program, that might actually be exposed to
the internet, can restrict itself to very basic functionality, and only
launch an inner, more useful program once authentication has completed.&lt;/p&gt;

&lt;p&gt;Other tricks are to leverage UNIX user names to insulate various programs
from each other, leaving the task of getting the access control details
right to the very well tested operating system (which we need to rely on
anyhow).&lt;/p&gt;

&lt;p&gt;While sometimes unconventional, techniques such as those described above can
simultaneously reduce code complexity AND increase security, by more or less
hitching a ride on top of existing functionality.&lt;/p&gt;

&lt;p&gt;Some time ago, I was involved in the development of a computer program with
a classic &amp;#8216;producer/consumer&amp;#8217; problem. We were inserting events in the
database, and wanted to scale by getting a dedicated and very fast database
server. To our surprise, getting an additional, and far more powerful system
did not improve our performance, and in fact made things far worse.&lt;/p&gt;

&lt;p&gt;What happened? It turns out we were doing a lot of small inserts into the
database, and even while we were using a transaction, each of these inserts
incurred a slight latency penalty, caused by the query &amp;amp; answer packets
having to travel over the network. And when doing hundreds of thousands of
queries, even half a millisecond is a lot of time. Add in operating system
and TCP overhead, and the end to end latency is probably even higher.
The obvious solution is to no longer actually wait for the inserts to
complete, but to transmit them to the database asynchronously, and continue
to do useful work while the packets are in flight and being processed. This
way, no time is wasted waiting.&lt;/p&gt;

&lt;p&gt;Since most database APIs are synchronous, a separate helper thread of
execution needs to be spawned to create the fiction of asynchrony, and this
is where things get interesting.&lt;/p&gt;

&lt;p&gt;In the PowerDNS nameserver, a complicated &amp;#8216;Distributor&amp;#8217; abstraction is used
to send queries to database threads, and this Distributor contains locks,
semaphores and a zoo of other concurrent programming techniques to make
things work well. For example, we need to perform checks to see if we aren&amp;#8217;t
building up an unacceptable backlog of queries, and block if we find we are.
This comes with additional choices as to when to unblock etc. I was not
looking forward to reimplementing such a thing.&lt;/p&gt;

&lt;p&gt;Additionally, our database interface needed to offer an extra feature:
every once in a while a query comes along that we DO need to wait for, and
because of coherency issues, such a query can only be executed once all
queries &amp;#8216;in flight&amp;#8217; have finished.&lt;/p&gt;

&lt;p&gt;So we spent some time pondering this, and suddenly it dawned on me that many
of the features we needed exactly match the semantics of the venerable UNIX
&amp;#8216;pipe&amp;#8217;.&lt;/p&gt;

&lt;p&gt;A pipe is normally used to communicate between two processes, as exemplified
by this sample shell script command, which shows us the largest directories
on a disk:&lt;/p&gt;

&lt;blockquote&gt;
    &lt;p&gt;$ du | sort -n&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The program &amp;#8216;du&amp;#8217; generates a list of directories and their sizes, which is
then fed to sort which outputs this in ascending order. 
However, nothing prohibits us from using a pipe to communicate with
ourselves - and as such it might be a might fine conduit to pass database
queries through to our database worker thread.&lt;/p&gt;

&lt;p&gt;This has some very nice benefits. Pipes are incredibly efficient, since a
lot of UNIX performance depends on them. Additionally, they implement sane
blocking behaviour: if too much data is stuck in the pipe, because the other
process does not take it out again quickly enough, the sending process
automatically blocks. The operating system implements high and low water
marks to make this (un)blocking happen efficiently.&lt;/p&gt;

&lt;p&gt;Furthermore, pipes guarantee that data up to a certain size can either be
written as a whole, or not written at all - making sure we don&amp;#8217;t have to
deal with partial messages.&lt;/p&gt;

&lt;p&gt;Finally, pipes automatically detect when the process on the other end of
them has gone away, or has closed its end of the pipe.&lt;/p&gt;

&lt;p&gt;However, not all is good. In order to transmit something over a pipe, it
must be serialised into bytes - we can&amp;#8217;t transmit ready to use objects over
them. Additionally, because pipes implement &amp;#8216;stream&amp;#8217; behaviour, we need to
delineate one message from the other, because the pipe itself does not say
where a message begins and ends - unlike datagram sockets for example.&lt;/p&gt;

&lt;p&gt;And this is the clever bit of our idea. As stated above, pipes are usually
employed to transmit data from one process to the other. In our case, the
pipe goes from one thread of execution to the other - within the same
process, and thus within the same memory space. So we don&amp;#8217;t need to send
serialized objects at all, and can get away with transmitting &lt;em&gt;pointers&lt;/em&gt; to
objects. And the nice thing is, pointers all have the same (known) length -
so we can do away with both delineation and serialisation.&lt;/p&gt;

&lt;p&gt;Additionally, pointers are a lot smaller than most messages, which means we
can stuff more messages in the same (fixed) size of the pipe buffer.&lt;/p&gt;

&lt;p&gt;So, are we done now? Sadly no - we have the additional need to be able to
&amp;#8216;flush the pipe&amp;#8217; in order to perform synchronous queries that we do need to
wait for. &lt;/p&gt;

&lt;p&gt;This is where things get complicated, but for those who really want to know,
I&amp;#8217;ll explain it here. It took almost a day of hacking to get it right
however, and I&amp;#8217;m explaining it for my own benefit as much as for that of the
reader, since I&amp;#8217;m bound to forget the details otherwise.&lt;/p&gt;

&lt;p&gt;If a synchronous query comes along, we need to flush the pipe, but UNIX
offers no such ability. Once we&amp;#8217;ve written something to a pipe, all the
kernel guarantees us is that it will endeavour to deliver it, but there is
no system call that allows us to wait for all data to actually be delivered.&lt;/p&gt;

&lt;p&gt;So we need to find a way to signal a &amp;#8216;write barrier&amp;#8217;, and the obvious way to
do so is to send a NULL pointer over the pipe, which tells the other end we
want to perform a synchronous query. Once the worker thread has seen the
NULL pointer, it unlocks the single controlling mutex (which is the return
signal that says &amp;#8220;got you -the pipe is empty&amp;#8221;), and then waits for further
pointers to arrive.&lt;/p&gt;

&lt;p&gt;Meanwhile, the sending thread tries to lock that same mutex immediately
after sending the NULL pointer, which blocks since the receiving thread
normally holds the lock. Once the lock succeeds, this tells us the worker
thread has indeed exhausted all queries that were in flight.&lt;/p&gt;

&lt;p&gt;The sending thread now performs its synchronous database work, knowing the
database is fully coherent with all queries it sent out previously, and also
knowing the worker thread is not simultaneously accessing the connection -
since it is instead waiting for a new pointer to arrive.&lt;/p&gt;

&lt;p&gt;If our program now wants to perform further asynchronous queries it can
simply transmit further pointers to the worker thread - which oddly enough
does not need to retake the mutex. This is what caused us many hours of
delay, because intuitively it seems obvious that once the sending thread is
done, it must release the mutex so the worker thread can retake it.&lt;/p&gt;

&lt;p&gt;As it turns out, doing so opens a whole world of nasty race conditions which
allow synchronous queries to &amp;#8216;jump the queue&amp;#8217; of asynchronous queries that
are in flight and have not yet arrived.&lt;/p&gt;

&lt;p&gt;So, the sequence is that the worker thread only unlocks the mutex, while the
sending thread only locks it.&lt;/p&gt;

&lt;p&gt;And this basically is it! So how much lines of code did we save by using the
magic of UNIX pipes? The pipe handling code takes all of 90 lines, whereas
the Distributor code of PowerDNS takes a round 300, even though it does not
offer synchronous queries, does not automatically block if too many
queries are outstanding, and most certainly couldn&amp;#8217;t implement the sensible
wakeup ability that UNIX pipes do offer.&lt;/p&gt;

&lt;p&gt;Oh, and you might be wondering by now, did it help? Indeed it did - our
program is now at least 20 times faster than it used to be, and there was
much rejoicing.&lt;/p&gt;</description>
      <pubDate>Thu, 20 Sep 2007 21:57:00 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:83ebe649-e3df-404b-8d15-c6fee369bafa</guid>
      <author>bert.hubert@netherlabs.nl (bert hubert)</author>
      <link>http://blog.netherlabs.nl/articles/2007/09/20/reusing-unix-semantics-for-fun-and-profit</link>
    </item>
    <item>
      <title>"Reusing UNIX semantics for fun and profit" by Augie</title>
      <description>So does that mean the PowerDNS Distributor code is going to get a re-work? :)</description>
      <pubDate>Sun, 23 Sep 2007 19:28:05 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:4058748e-74a1-4655-bdf1-964762457db6</guid>
      <link>http://blog.netherlabs.nl/articles/2007/09/20/reusing-unix-semantics-for-fun-and-profit#comment-142572</link>
    </item>
    <item>
      <title>"Reusing UNIX semantics for fun and profit" by piotr</title>
      <description>&lt;a href="http://highscalability.com/paper-end-architectural-era-it-s-time-complete-rewrite" rel="nofollow"&gt;http://highscalability.com/paper-end-architectural-era-it-s-time-complete-rewrite&lt;/a&gt;</description>
      <pubDate>Fri, 21 Sep 2007 22:02:09 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:b58c19fe-8570-4c9f-9e83-fe9f3470f48a</guid>
      <link>http://blog.netherlabs.nl/articles/2007/09/20/reusing-unix-semantics-for-fun-and-profit#comment-142571</link>
    </item>
    <item>
      <title>"Reusing UNIX semantics for fun and profit" by Stéphane Bortzmeyer</title>
      <description>Your shell example of the use of a pipe may not be the best one since "sort" cannot start work until "du" have completed.
</description>
      <pubDate>Fri, 21 Sep 2007 16:02:14 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:98da4efb-13ae-4d35-b980-7a1a1483b7c1</guid>
      <link>http://blog.netherlabs.nl/articles/2007/09/20/reusing-unix-semantics-for-fun-and-profit#comment-142570</link>
    </item>
  </channel>
</rss>
