Posted by bert hubert
Sun, 16 Nov 2008 21:21:00 GMT
- Idea - estimates for time to completion range from 3 days to 3 weeks
- Pretty convincing first stab ‘look how cool this would be’
- The Hard Slog to get something that actually works. Estimates now range from 3 months to 3 years.
- First real users pop up, discovery is made that all assumptions were off
- Starts to look good to the first real user
- Elation!
- Someone actually uses the code it for real, the bugs come out in droves
- A zillion bugs get addressed, harsh words are spoken
- Elation!
- The guy you had previously told that 100 million users would not ‘in principle’ be a problem actually took your word for it, and deployed it on said user base. Harsh words are spoken.
- Fundamentals are reviewed, large fractions of the code base reworked
- Product now actually does what everybody hoped it would do.
- Even very unlikely bugs have cropped up by now, and have been addressed. Even rare use cases are now taken into account.
- If a user complains of a crash at this stage, you can voice doubts about the quality of his hardware or operating system.
PowerDNS went through all these stages, and took around 5 years to do so.
Not all parts are at ‘stage 14’ yet, but for the Recursor, I seriously ask people to run ‘memtest’ if they report a crash.
The above 14 points are never traversed without users that care. For PowerDNS, step ‘4’ was performed by Amaze Internet and step ‘7’ by ISP Services. 1&1 (called Schlund back then) was instrumental in step ‘10’ when they started using it on millions of domains.
For the PowerDNS Recursor, steps ‘4’ and ‘7’ not only happened over at XS4ALL, but they also paid for it all!
Step ‘10’ occurred over at AOL and Neuf Cegetel, who together connected the Recursor to 35 million subscribers or so.
Finally, the parts of PowerDNS that have reached the end of the list above have done so because of literally hundreds if not thousands of operators that have made the effort to report their issues, or voice their wishes.
Many thanks to everybody!
Hmm, the above does not sound very professional..
I’ve heard the theory that some people think they can plan software development more professionally. I used to believe them too. But any real project I’ve heard of went through the stages listed above. No schedule, no Microsoft Project sheet, no Gantt Chart I know about ever even came close to reality.
But I’d love to be wrong, because I agree fully that it would be great if software development was more predictable.
This is especially true since the aforementioned “process” necessarily involves several very committed users, who have to voice the harsh words, but do have to stick with the project.
So please comment away if your real life experiences are different - I’d love to hear!
Posted in PowerDNS, Netherlabs | no comments
Posted by bert hubert
Thu, 18 Sep 2008 19:44:00 GMT
After too much posting on IETF mailing lists, and not achieving anything, I’ve gone back to coding a bit more.
There are two things I want to share - the first because I had a devil of a time figuring out how to do something, and I hope that posting here will help fellow-sufferers find the solution via Google. The second thing I want to talk about because, and this is getting to be rare, I programmed something cool, and I just need to tell someone about it.
I pondered explaining it to my lovely son Maurits (4 months old today), but I don’t want to ruin his brain.
Debugging iterators
In most programming languages there are a lot of things that compile just fine, or generate no parsing errors at runtime, but are still accidents waiting to happen.
Tools abound to expose such silent errors, usually at a horrendous performance cost. But this is fine, as errors can be found by the developer, and fixed before release.
As part of our arsenal, we have the veritable Valgrind that detects things such as reading from memory that had not previously been written to. In addition, other tricks are available, such as changing functions that ‘mostly do X, and rarely Y’ so that they always to Y. This quickly finds programs that skipped dealing with Y (which might be a rare error condition, or realloc(2) returning a new memory address for your data).
Finally, many programming environments by default perform very little checking (in the interest of speed) - for example, they will gladly allow you to compare something that points to data in collection A to something that points to collection B - a comparison that never makes sense, classical apples and oranges.
My favorite C++ compiler, G++, comes with so called ‘debugging iterators’ that are very strict - anything that is not absolutely correct becomes an error, sometimes at compile time, sometimes at runtime.
Together with Valgrind, this is one of the techniques I like to whip out when the going gets tough.
Sadly, Debugging iterators (which are turned on by adding -DGLIBCXXDEBUG conflict with one of my favorite C++ libraries, Boost.
To make a long story short, to compile a version of Boost with debugging iterators, issue:
$ bjam define=_GLIBCXX_DEBUG
This single line of text may not look all that important, but it took me half a day of debugging to figure this out. So if you get this error:
dnsgram.o:
(.rodata._ZTVN5boost15program_options11typed_valueISscEE[vtable for
boost::program_options::typed_value, std::allocator >, char>]+0x18):
undefined reference to
`boost::program_options::value_semantic_codecvt_helper::parse(boost::any&,
std::__debug::vector,
std::allocator >, std::allocator, std::allocator > > > const&, bool) const'
Then compile your own version of boost as outlined above.
C++ Introspection & Statistics
C++ is an old-school language, perhaps the most modern language of the old school. This means that it sacrifices a lot of things to allow programs to run at stunning ‘near bare metal’ speeds. One of the things that C++ does not offer therefore is ‘introspection’
What this means is that if you have a class called “ImportantClass”, that class does not know its own name at runtime. When a program is running, it is not possible to ask by name for an “ImportantClass” to be instantiated.
If you need this ability, you need to register your ImportantClass manually by its name “ImportantClass”, and store a pointer to a function that creates an ImportantClass for you when you need it.
Doing so manually is usually not a problem, except of course when it is. In PowerDNS, I allocate a heap (or a stack even) of runtime statistics. Each of those statistics is a variable (or a function) with a certain name.
In more modern languages, it would probably be easy to group all these variables together (with names like numQueries, numAnswers, nomUDPQueries etc), and allow these statistics to be queried using their names. So, an external program might call ‘get stat numQueries’, and PowerDNS would look up the numQueries name, and return its value.
No such luck in C or C++!
So - can we figure out something smart, say, with a macro? Yes and no. The problem is that when we declare a variable in C which we want to be accessible from elsewhere in the program, it needs to happen either inside a struct or class, or at global scope. This in turn means that we can’t execute code there. So, what we’d like to do, but can’t is:
struct Statistics {
uint64_t numPackets;
registerName(&numPackets, "numPackets");
uint64_t numAnswers;
registerName(&numAnswers, "numAnswers");
} stats;
stats.numPackets is indeed available, but the line after its definition will generate an error. This is sad, since the above could easily be generated from a macro, so we could do:
DEFINESTAT(numPackets, “Number of packets received”);
Which would simultaneously define numPackets, as well as make it available as “numPackets”, and store a nice description of it somewhere.
But alas, this is all not possible because of the reasons outlined above.
So - how do we separate the data structure from the ‘registerName()’ calls, while retaining the cool ‘DEFINESTAT’ feature where everything is in one place?
In C++, files can be included using the #include statement. Most of the time, this is used to include so called ‘header’ files - but nothing is stopping us from using this feature for our own purposes.
The trick is to put all the ‘DEFINESTAT’ statements in an include file, and include it not once, but twice!
First we do:
#define DEFINESTAT(name, desc) uint64_t name;
struct Statistics {
#include "statistics.h"
} stats;
This defines the Statistics struct, containing all the variables we want to make available. These are nicely expanded using our DEFINESTAT definition.
Secondly, we #undefine DEFINESTAT again, and redefine it as:
#define DEFINESTAT(name, desc) registerName(&stats.name, #name, desc);
Then we insert this in a function:
void registerAllNames()
{
#include "statistics.h"
}
This will cause the same statistics.h file to be loaded again, with the same DEFINESTAT lines in there, but this time DEFINESTAT expands to a call that registers each variable, its name (#name expands to “name”), and its description.
The rest of our source can now call ‘stats.numPackets++’, and if someone wants to query the “numPackets” variable, it is available easily enough through its name since it has been registered using registerName.
The upshot of this all is that we have gained the ability to ‘introspect’ our Statistics structure, without any runtime overhead nor any further language support.
As stated above, more modern languages make this process easier.. but not as fast!
I hope you enjoyed this arcane coolness as much as I did. But I doubt it :-)
Posted in Linux, PowerDNS, Netherlabs | no comments
Posted by bert hubert
Mon, 04 Aug 2008 22:15:00 GMT
This post sets out to calculate how hard it is to spoof a resolver that
takes simple, unilateral, steps to prevent such spoofing.
Unilateral in this case means that any resolver can implement these steps,
without changing the DNS protocol or authoritative server behaviour.
Everybody that implements the ideas below immediately improves the general
security of the DNS.
To save you all the reading, the simple unilateral measures can bring down
the chance to be spoofed to under 1% after a year of non-stop 50 megabit/s
packet blasting.
Note however that my math or my ideas may be wrong. Please read carefully!
Work so far
Recapping, calculations so far show that a fully source port randomized
nameserver can be spoofed with a 64% chance of success within 24 hours,
requiring around 0.4TB of packets to be generated. If 2 hours are available,
the chance of success is 8.1%.
This assumes that the attacker is able to generate around 50 megabits/s, and
also important, that the resolver is able to process 50k incoming responses/s.
Details are on
http://www.ops.ietf.org/lists/namedroppers/namedroppers.2008/msg01194.html
with the caveat that where it says “1.5 - 2 GB”, it should say “36GB”.
Since that posting, quite a number of people have studied the calculations,
and they appear to hold up.
Note that the present post does not address the dangers created by those
able to actively intercept and modify traffic - people with such abilities
have little need to spoof DNS anyhow.
Situation
The current situation is not acceptable - the resources needed to perform a
successful spoofing attack are available, if not generally, then to an
relevant subset of Internet users.
It turns out that it takes a lot of effort to get the world to apply even a
minor patch that has received tremendous front-page coverage on all the
security websites - the source port randomisation patches in this case.
So if we want to move quickly, we need a solution that can be rolled out
without having to upgrade large parts of the Internet.
Agile countermeasures
There are a number of strategies a resolver could employ to make itself
effectively unspoofeable, some of these include:
A) Sending out questions over TCP/IP
B) Repeating questions a number of times, and requiring the answers to be
equivalent
The problems with these two techniques is that they imply a certain overhead
and increase the general CPU utilization on the Internet.
Furthermore there are strategies that make spoofing harder, but not
impracticable:
C) Case games (‘dns-0x20’)
D) Use multiple source addresses
A very comprehensive enumeration of techniques can be found on
http://www.ops.ietf.org/lists/namedroppers/namedroppers.2008/msg01131.html
Sadly, it appears that on this impressive list there is nothing without more
or less unacceptable overhead that really gets us out of the woods, where
this is arbitrarily defined as reducing the spoofing success rate to below
1% after a year of non-stop trying at 50 megabit/s.
Detecting a spoofing attempt in progress
Since any spoofed response has a chance of around 2^-32 of being accepted,
it stands to reason around 2^31 bogus responses will arrive at the resolver
before the attacker manages to achieve his goals.
Since we know we have effective countermeasures available, like A and B
mentioned above, we could deploy these in case a spoofing attempt is
detected. Remember that A and B are generally available, but that we don’t
want to resort to them all the time, for all domains, because of their
overhead.
Occasionally, port numbers get modified in transit. Additionally, responses
to queries sometimes arrive late enough that a new equivalent query has
since been sent. This means we should not consider a single response
mismatch to be a sign of a spoofing attempt in progress.
If we allow X mismatches before falling back on A or B, the chance of a
single query being spoofed is:
X
--------- ~= 2^-32 * X
N * P * I
Assuming each individual attack lasts W seconds (being latency between the
authentic authoritative server and the resolver), the combined spoofing
chance after T seconds becomes:
T/W
1 - (1 - 2^-32 * X) ~= 2^-32 * X * T/W
Putting in 20 for X and 0.1s for W, this gets us a combined spoofing chance
of 0.4% for a full day. Interesting, but not good enough, especially since
the attacker might well send only X packets per attempt, and launch far more
attempts.
However, if the attacker has a defined goal about what to spoof, each
successive attempt might be for a different domain name, or differ in other
respects, but all those attempts will share some characteristics.
Two things that will be identical, or at least reasonably unique are the
source address of the spoofed packet (aka, the network address of the
authentic authoritative server), plus the ‘zone cut apex’. This last bit
requires some understanding of the way a resolver works.
(I made up this phrase, ‘zone cut apex’, but I think there are people with
better knowledge of DNS verbiage than I, I’d love to hear of a better name)
When a resolver asks NS1.EXAMPLE.COM for ‘this.is.a.long.example.com’, the
resolver knows it is asking NS1.EXAMPLE.COM in its role as ‘EXAMPLE.COM’
nameserver - this is how it selected that server.
This means that an attacker might try many variations on
‘this.is.a.long.example.com’, what will remain identical is the ‘zone cut
apex’, which is ‘EXAMPLE.COM’. What will also remain identical is the
(small) set of example.com servers available.
I’ll get into more detail after Dan Kaminksy has held his presentation. The
upshot however is that multiple different attempts can be correlated, and
thus be counted together in the spoofing-detection counts.
If we conservatively decide to impose a 5 minute ‘fallback to A or B’-regime
for a {source IP, zone cut apex}-tuple, and leave the detection limit at 20,
this means an attacker will have one chance every 5 minutes of getting in an
attempt.
This is equivalent to setting W equal to 300 seconds above, yielding us a
combined chance of spoofing a domain after a year of trying of 0.05%.
Well within our goal.
Reality intrudes
Sadly, the reality is that we won’t recognize all spoofed packets that
guessed wrong, so to speak. Typical operating systems will only let a
nameserver know about packets that arrived on a socket open for that
purpose.
In the very worst case, the server is only listening on a single port, and
by the time a single mismatch is received by the nameserver process, on
average 32000 will have arrived on the network interface.
This means that in the calculation above, if we don’t take additional
measures, we need to set X to 32000, leading to a combined monthly
spoofing chance of 6.4% (and a yearly near-certainty).
Fine tuning things
By raising the fallback period to an hour, the yearly spoofing chance
becomes 6.5%, assuming we only see 1 in every 32000 spoof attempts.
If in addition a small number of sockets is kept open, say 10, to function as
‘canary ports’, X reduces to 3200, and the yearly spoofing chance is back at
a low 0.65%.
Canary ports serve no function except to detect spoofing attempts. For
efficiency reasons, these ports may simply be ports that had previously been
used for source port randomisation, but kept around somewhat longer.
The number of canary ports, and the fallback period can be tuned at will to
achieve desired spoofing resilience levels.
Remaining Problems
Countermeasure ‘A’ does not work for domains not offering TCP service.
Countermeasure ‘B’ does not work for domains where single authoritative
servers generate differing answers to repetitive questions. This might
happen in case of (hardware) load balancers, or load balanced desynchronised
nameservers.
Best results might be achieved by alternately trying countermeasure A and B - any server that does not support TCP and sends out inconsistent replies
is in for some tough love if someone attempts to spoof the domains it hosts.
Conclusion
If the calculations and ideas elaborated above hold up, it appears feasible
to achieve arbitrarily low spoofing chances, without doing any further
protocol work.
Importantly, such changes would allow individual resolvers to protect their
users without having to wait for changes to all authoritative servers.
In other words, everybody who participates receives an immediate benefit.
1 comment
Posted by bert hubert
Wed, 09 Jul 2008 19:31:00 GMT
Yesterday it was announced that there is an unspecified but major DNS vulnerability, and that Microsoft, Nominum and ISC had fixes available.
It is amusing to note that this has been hailed as a major feat of cooperation, with the vulnerable parties spinned as being part of secret industry cabal that has just saved the world from very bad things.
To say the least, I find this a funny way of presenting things! The vulnerability is still not public, but the secret cabal shared it with me. Perhaps it is fair to say I am part of the cabal - I nearly traveled to the secret meeting at the Microsoft campus, but the imminent birth of my son made me decide not to go.
The DNS vulnerability that has been presented yesterday is indeed a very serious problem, and I am glad steps are now taken to fix the broken software that was vulnerable. Dan Kaminksy is to be praised for discovering the issue and coordinating the release.
However - the parties involved aren’t to be lauded for their current fix. Far from it. It has been known since 1999 that all nameserver implementations were vulnerable for issues like the one we are facing now. In 1999, Dan J. Bernstein released his nameserver (djbdns), which already contained the countermeasures being rushed into service now. Let me repeat this. Wise people already saw this one coming 9 years ago, and had a fix in place.
In 2006 when my own resolving nameserver entered the scene, I decided to use the same security strategy as implemented in djbdns (it is always better to steal a great idea than to think up a mediocre one!). Some time after that, I realised none of the other nameservers had chosen to do so, and I embarked on an effort to move the IETF DNS-EXT working group to standardise and thus mandate this high security behaviour.
This didn’t really go anywhere, but some months ago I noticed particularly strenuous resistance in the standardisation of the so called ‘forgery resilience draft’, and after some prodding it became clear it was felt my draft was in danger of drawing attention to the then unannounced DNS vulnerability, and that it were best if we’d all shut up about it for a few months, perhaps until July 2008 until all the vendors would have had time to get their act together.
And now we’ve seen the release, and it is being hailed as great news. But it isn’t. Dan Bernstein has been ignored since 1999 when he said something should be done. I’ve been ignored since 2006. The IETF standardisation languished for two years.
This is not a success story. It has in fact been a remarkable failure.
To end on a positive note - I am very glad Dan Kaminsky’s work caused some collective eye opening, and I hope good things come from this. DNS has long lacked critical attention, and in the end this might bring about sorely needed improvements.
DNS very recently celebrated its 25th birthday - I look forward to seeing the venerable Domain Name System succeed in the coming 25 years!
Posted in PowerDNS, Netherlabs | 8 comments
Posted by bert hubert
Sat, 24 May 2008 08:37:00 GMT
18th of May, Delft, The Netherlands
Mirjam & Bert are proud to announce the birth of their son Maurits Hubert! Mother, son & father are doing very well.
Feel free to email the little guy on maurits@hubertnet.nl!
Picture when Maurits was only an hour old:
And a slightly geeky Droste Effect photo:

Posted in Life | no comments
Posted by bert hubert
Tue, 13 Nov 2007 12:42:00 GMT
Exactly one year ago today, my father passed away, less than a year after my mother did.
Here you can see them in happier times, together with the other subject of this post:
While we mourn their passing today, not all news is bad. I’m happy to announce Mirjam and I are expecting a baby!
We’re very happy, but sad we won’t be able to share the good news with my parents. But: life goes on - which is literally true in this case.
Bert & Mirjam
Posted in Life | 3 comments
Posted by bert hubert
Sun, 11 Nov 2007 10:36:00 GMT
While running the risk of turning this blog into a lecture series, I can’t
resist. This post will dive into cryptography, and I hope to be able to
transfer the sense of wonder that caught me when I first read about Diffie-Hellman
key exchange many years ago.
Let’s assume you are in a room with two other people, and that you want to
share a secret with one of them, but not with the other. In the tradition of
cryptography, we’ll call these three people Alice (you), Bob (your friend)
and Eve (who wants to ‘Eavesdrop’ on your secrets).
Let’s also assume that the room is very quiet, so you can’t whisper, and
everybody can hear what everybody else is saying. Furthermore, you are far enough away that you can’t pass paper messages.
So how could you (Alice) share a secret with Bob? Anything you want to tell
Bob, will be overheard by Eve. You might try to think up a code, but you’ll
still have to explain the code, and both Bob and Eve will hear it.
It turns out that using the magic of public key cryptography, this is
possible - sharing a secret while people are listening in.
The room with Alice, Bob and Eve is not a very relevant example, but replace
Alice by ‘The allied forces’, ‘Bob’ by a resistance fighter equipped with a
radio, and ‘Eve’ by the occupying enemy, and things start to make sense.
Or, in today’s terms, replace Bob by Amazon.com, and Eve by a hacker
interested in getting your credit card number.
So how does it work?
To send a secret, two things are needed: an ‘algorithm’ and a ‘key’. A famous
algorithm is the ‘Caesar cypher’, which consists of shifting all letters by
a fixed amount. So an A might become a B, a B would become a C etc etc.
The key in this case is how much you want to shift the letters, in the
sample above the key is ‘1’. If the key had been ‘2’, an A would’ve become a
C, a B would’ve become a D etc.
Typically, you can discuss the algorithm in public, but you need to keep the
key secret. In terms of Alice and Bob, they will be able to communicate in
secret once they’ve been able to establish a key that Eve does not know
about.
Once everybody has agreed to use the Caesar cypher, the problem shifts to
exchanging how many letters we will shift. We can’t just say this out loud,
since both Bob and Eve will hear it.
Diffie-Hellman
Way back in 1976, Whitfield Diffie and Martin Hellman published the details
of what has become known as the Diffie-Hellman key exchange algorithm,
although they both credit Ralph Merkle with some of the key ideas.
The process basically works as follows. Alice and Bob each think of a random
number, that they keep a secret. Then they both do some calculations based
on this number, and say the result of those calculations out loud.
Then both Alice and Bob combine the results of the calculations with their own
secret random number, and out pops a shared random secret number. This
shared random secret number is now known by Alice and Bob, but not by Eve.
And it is this secret that now becomes the key.
How is this possible?
Eve heard both Alice and Bob say a random number, exactly the same numbers
that Alice and Bob heard. Yet only Alice and Bob now know the shared secret.
How is this possible?
The trick lies in the calculation, by which means Alice and Bob only shared
parts of the numbers they chose initially. Then both Alice and Bob combined
those parts with their full random numbers.
It is this trick of revealing only parts of random numbers, and then
combining the part of the other party with your full number, that leads to a
shared secret.
Show me
On this page I wrote a very simple Diffie-Hellman example program that runs entirely within your web browser. You can either use it alone, or with a friend - which is the most fun. It works over the phone, or over an instant messenger (IRC, MSN etc). Follow the instructions, encode a message, paste it to your friend, and if your friend followed the instructions, and he pastes the encoded message into the decoder, he should see your secret message!
This is even more fun in a chat room with actual Eve’s present.
Please be aware that the sample is a joke - don’t use it to share real secrets! However, the technology it employs is real, and this truly is how people exchange keys - only the numbers are far larger (300 digits), and the actual encryption is not a Caesar cypher.
So how does it really work?
More information can be found on the wikipedia page about Diffie-Hellman, especially in the ‘external links’ section.
Posted in PowerDNS | 1 comment
Posted by bert hubert
Thu, 20 Sep 2007 19:57:00 GMT
I’ve long been a fan of some of the techniques Dan
Bernstein uses to
leverage the power of UNIX to achieve complicated goals with little effort.
For example, he uses a technique called Chain
Loading to clearly separate and
insulate several programs from each other by loading a new program *in
place* of the current one, once a critical task has been performed, like
checking a user’s credentials.
This guarantees that the outer program, that might actually be exposed to
the internet, can restrict itself to very basic functionality, and only
launch an inner, more useful program once authentication has completed.
Other tricks are to leverage UNIX user names to insulate various programs
from each other, leaving the task of getting the access control details
right to the very well tested operating system (which we need to rely on
anyhow).
While sometimes unconventional, techniques such as those described above can
simultaneously reduce code complexity AND increase security, by more or less
hitching a ride on top of existing functionality.
Some time ago, I was involved in the development of a computer program with
a classic ‘producer/consumer’ problem. We were inserting events in the
database, and wanted to scale by getting a dedicated and very fast database
server. To our surprise, getting an additional, and far more powerful system
did not improve our performance, and in fact made things far worse.
What happened? It turns out we were doing a lot of small inserts into the
database, and even while we were using a transaction, each of these inserts
incurred a slight latency penalty, caused by the query & answer packets
having to travel over the network. And when doing hundreds of thousands of
queries, even half a millisecond is a lot of time. Add in operating system
and TCP overhead, and the end to end latency is probably even higher.
The obvious solution is to no longer actually wait for the inserts to
complete, but to transmit them to the database asynchronously, and continue
to do useful work while the packets are in flight and being processed. This
way, no time is wasted waiting.
Since most database APIs are synchronous, a separate helper thread of
execution needs to be spawned to create the fiction of asynchrony, and this
is where things get interesting.
In the PowerDNS nameserver, a complicated ‘Distributor’ abstraction is used
to send queries to database threads, and this Distributor contains locks,
semaphores and a zoo of other concurrent programming techniques to make
things work well. For example, we need to perform checks to see if we aren’t
building up an unacceptable backlog of queries, and block if we find we are.
This comes with additional choices as to when to unblock etc. I was not
looking forward to reimplementing such a thing.
Additionally, our database interface needed to offer an extra feature:
every once in a while a query comes along that we DO need to wait for, and
because of coherency issues, such a query can only be executed once all
queries ‘in flight’ have finished.
So we spent some time pondering this, and suddenly it dawned on me that many
of the features we needed exactly match the semantics of the venerable UNIX
‘pipe’.
A pipe is normally used to communicate between two processes, as exemplified
by this sample shell script command, which shows us the largest directories
on a disk:
$ du | sort -n
The program ‘du’ generates a list of directories and their sizes, which is
then fed to sort which outputs this in ascending order.
However, nothing prohibits us from using a pipe to communicate with
ourselves - and as such it might be a might fine conduit to pass database
queries through to our database worker thread.
This has some very nice benefits. Pipes are incredibly efficient, since a
lot of UNIX performance depends on them. Additionally, they implement sane
blocking behaviour: if too much data is stuck in the pipe, because the other
process does not take it out again quickly enough, the sending process
automatically blocks. The operating system implements high and low water
marks to make this (un)blocking happen efficiently.
Furthermore, pipes guarantee that data up to a certain size can either be
written as a whole, or not written at all - making sure we don’t have to
deal with partial messages.
Finally, pipes automatically detect when the process on the other end of
them has gone away, or has closed its end of the pipe.
However, not all is good. In order to transmit something over a pipe, it
must be serialised into bytes - we can’t transmit ready to use objects over
them. Additionally, because pipes implement ‘stream’ behaviour, we need to
delineate one message from the other, because the pipe itself does not say
where a message begins and ends - unlike datagram sockets for example.
And this is the clever bit of our idea. As stated above, pipes are usually
employed to transmit data from one process to the other. In our case, the
pipe goes from one thread of execution to the other - within the same
process, and thus within the same memory space. So we don’t need to send
serialized objects at all, and can get away with transmitting pointers to
objects. And the nice thing is, pointers all have the same (known) length -
so we can do away with both delineation and serialisation.
Additionally, pointers are a lot smaller than most messages, which means we
can stuff more messages in the same (fixed) size of the pipe buffer.
So, are we done now? Sadly no - we have the additional need to be able to
‘flush the pipe’ in order to perform synchronous queries that we do need to
wait for.
This is where things get complicated, but for those who really want to know,
I’ll explain it here. It took almost a day of hacking to get it right
however, and I’m explaining it for my own benefit as much as for that of the
reader, since I’m bound to forget the details otherwise.
If a synchronous query comes along, we need to flush the pipe, but UNIX
offers no such ability. Once we’ve written something to a pipe, all the
kernel guarantees us is that it will endeavour to deliver it, but there is
no system call that allows us to wait for all data to actually be delivered.
So we need to find a way to signal a ‘write barrier’, and the obvious way to
do so is to send a NULL pointer over the pipe, which tells the other end we
want to perform a synchronous query. Once the worker thread has seen the
NULL pointer, it unlocks the single controlling mutex (which is the return
signal that says “got you -the pipe is empty”), and then waits for further
pointers to arrive.
Meanwhile, the sending thread tries to lock that same mutex immediately
after sending the NULL pointer, which blocks since the receiving thread
normally holds the lock. Once the lock succeeds, this tells us the worker
thread has indeed exhausted all queries that were in flight.
The sending thread now performs its synchronous database work, knowing the
database is fully coherent with all queries it sent out previously, and also
knowing the worker thread is not simultaneously accessing the connection -
since it is instead waiting for a new pointer to arrive.
If our program now wants to perform further asynchronous queries it can
simply transmit further pointers to the worker thread - which oddly enough
does not need to retake the mutex. This is what caused us many hours of
delay, because intuitively it seems obvious that once the sending thread is
done, it must release the mutex so the worker thread can retake it.
As it turns out, doing so opens a whole world of nasty race conditions which
allow synchronous queries to ‘jump the queue’ of asynchronous queries that
are in flight and have not yet arrived.
So, the sequence is that the worker thread only unlocks the mutex, while the
sending thread only locks it.
And this basically is it! So how much lines of code did we save by using the
magic of UNIX pipes? The pipe handling code takes all of 90 lines, whereas
the Distributor code of PowerDNS takes a round 300, even though it does not
offer synchronous queries, does not automatically block if too many
queries are outstanding, and most certainly couldn’t implement the sensible
wakeup ability that UNIX pipes do offer.
Oh, and you might be wondering by now, did it help? Indeed it did - our
program is now at least 20 times faster than it used to be, and there was
much rejoicing.
3 comments
Posted by bert hubert
Sun, 26 Aug 2007 16:18:00 GMT
Ok - Steorn is quieting down for now, and it got enough attention anyhow, so it is time to look a bit into the things behind the appeal of alternative energy sources.
Many readers will recall that in the past, there was debate as to when the ‘oil would run out’, and that this date was supposed to be somewhere in 2045 or so, which was more or less far enough away not to worry about it.
At least I remember thinking about it like that back in school. It is amazing how this sentiment fooled us for so long. Modern tubes of toothpaste are easy to empty down to the last bit, but in the past this wasn’t so. This should’ve told us something.
Oil is not like modern toothpaste, it is like ketchup. Far before it has run out, it becomes hard to extract. And oil is remarkably worse than ketchup.
Back in 1956, one of Shell Oil’s scientists noticed that wells started to become less productive once 50% of their contents had been extracted. He then proceeded to predict US oil production based on this assumption, and correctly calculated it would peak somewhere in the late 1960s, and decline from that point onwards. And so it did.
Additionally, he extrapolated this result to the whole world, and determined global oil production would go into decline somewhere after the year 2000.
Controversy
Nobody much liked this prediction, and it was widely ridiculed. New wells would continue to be found, and importantly, new techniques would enable us to extract more and more oil from existing wells.
As it turned out, especially this last prediction was correct, which is why the world production of oil hasn’t declined already.
However, no major new fields have been found over the past decade.
Many players in the oil industry now believe the predictions, and agree that oil production might decline from 2010 onwards, or perhaps a bit later.
Production is peaking, demand is increasing
Controversy aside, the International Energy Agency has produced graphs of oil production and demand since 1974, and it is clear that production will one day be overtaken by demand.
It is easy to see why - as it comes out of the ground, oil is not immediately suitable for all kinds of use. For many purposes, it first needs to be ‘refined’. Building a refinery is hard work, and typically takes up to a decade. Additionally, environmental rules mean that it is easily possible to spend a similar amount of time just getting permission to build.
No major refineries have been built over the past years, and no major refineries are nearing completion. The existing refineries are running at or near peak production.
On the demand side, the world economy is growing at an unprecedented clip.
Will demand exceed supply?
The few graphs that plot oil production and demand in one plot (readers, if you know of any, please comment!) typically show a ‘and then a miracle occurs’ event when demand is about to overtake supply.
This reflects the usual market behaviour that once oil becomes scarce enough, prices will rise, and oil that was hitherto uneconomical to produce becomes economically viable. In other words, exploding prices make more oil available.
But as remarked previously, refineries are already running flat out. This means that no miracle will occur in the immediate future, and oil might very well run out temporarily.
To reiterate, this does not mean the oil is gone, just that it isn’t available at the rate we need it.
And then what?
This is the scary bit, and the main reason I worry. Already we see posturing by the big oil suppliers and consumers. China is pouring money into Africa, and has even deployed part of its army in certain countries to make oil production possible.
Russia is throwing its weight around in a frightening way as well, and making it clear not all of its customers are equal. It plays geopolitics both with hydrocarbon availability and pricing.
The various armies in the Middle East speak for themselves. A peaceful Middle East produces more oil, and it might very well sell it preferably to its occupiers or sponsors.
Here in Europe, we appear to believe oil might become mighty expensive, but that we’ll weather it.
But if oil becomes truly scarce, will market prices influence who will get access to it? Or will it be supplied to those countries with the ability to project power, and back up their monetary offers with military encouragement?
Or might suppliers become king-makers, with the power to determine which economy lives or dies?
Our European belief that our ability to pay steep prices will allow us to continue as normal might be seen as exceedingly silly by then, possibly comparable to Neville Chamberlain’s appeasement policy in the 1930s.
So when will all this happen?
It is happening already, but crunch time is not yet upon us. Some countries have already had problem getting access to enough energy, mostly those who (like Europe) depend on Russian oil and gas.
The crunch might be postponed if the economy stops growing at this rate, it might be advanced if any of the major refineries is damaged by terrorism, weather or bad luck.
At any rate, the issue should start making more headlines in the near future.
What about coal, nuclear energy, wind and solar energy? Tar sands?
Some countries have already accepted that we should start building more nuclear power plants because the energy is running out. However, building such installations also takes decades, and it has been argued we’d need to open a new power plant each month or so to make up lost ground.
Coal is currently environmentally harmful or expensive, but might save part of the industry for some time.
Wind and solar, although interesting, struggle to generate an appreciable fraction of our world energy need.
Tar sands, sand that contains oil, are interesting but not for the near future. They might make Canada extremely rich though.
Further reading
Google on ‘Hubbert Peak’, and head on from there. ‘Peak oil’ is also a nice phrase to search on. The International Energy Agency has long published honest and truthful graphs that presaged the issue, but up till recently the IEA did not put this into words. Recently they’ve begone to describe the near future oil situation as ‘extremely tight’.
1 comment
Posted by bert hubert
Wed, 04 Jul 2007 22:29:00 GMT
Well, it might’ve been too good to be true.
First the announcement that internet streaming of the ‘Steorn’ device over at Kinetica Museum would start at 6PM, which was later “clarified” to mean 6PM US eastern time.
And when that time passed, nothing happened. After a while, a notice appeared that due to technical difficulties, streaming would start on July 5th.
Perhaps this is the beginning of the end for Steorn.
Update: Steorn has confirmed the device is not operating as it should, but they say they are working on it, and intend to turn on the streams tomorrow, even if the device is still not working, so we can see “stressed engineers” trying to fix it.
In one of my first posts on this enigmatic company, I mentioned the possibility of them deluding themselves, and I’m afraid the things that have happened over the past few days point in that direction.
I’d still be very happy if Steorn turned out the be on to something, but the signs are not good..
The websites of Steorn and Kinetica still promise a demo, so perhaps they simply are having problems streaming. Will keep you posted.
no comments