Posted by bert hubert
Tue, 27 Jan 2009 21:56:00 GMT
Hi everybody!
What a day! Remco van Mook and I received a message today that our RFC Draft (full text here) has entered the ‘AUTH48’ stage. This means that it has been assigned a number (RFC 5452!), and that barring meteor strikes or similar things, we are now finally done. Yay!
We spent 2 years and 9 months on this. It felt like even more. I’ve been told the draft has already made a difference in some places - from now on, DNS implementations that have certain bad spoofing behaviour MUST
clean up their act :-)
In short, had this RFC been followed, the whole Kaminsky DNS scare could have been prevented. Do note that the draft is 2 years older than Kaminksy’s discovery. The DNS community should have listened to Dan Bernstein *10* years ago.
Some more thoughts on this subject can be found here. I’m slightly bitter.
As if the RFC weren’t enough excitement for one day, I also released PowerDNS Authoritative Server 2.9.22, the first release of the authoritative server in almost 20 months. Because of this long delay, a lot of effort was spent field testing this release before it ‘went gold’ (to use an expression I really despise).
I sincerely hope we shook out most of the bugs. The PowerDNS community really delivered, and many of our enthusiastic users deployed pre-release code on their significant installations, to make sure everybody else would be able to upgrade with confidence.
Read the whole store here.
Posted in PowerDNS | 3 comments
Posted by bert hubert
Sun, 18 Jan 2009 13:03:00 GMT
This post is about an obscure corner of TCP network programming, a corner
where almost everybody doesn’t quite get what is going on. I used to think I
understood it, but found out last week that I didn’t.
So I decided to trawl the web and consult the experts, promising them to
write up their wisdom once and for all, in hopes that this subject can be
put to rest.
The experts (H. Willstrand, Evgeniy Polyakov, Bill Fink, Ilpo Jarvinen,
and Herbert Xu) responded, and here is my write-up.
Even though I refer a lot to the Linux TCP implementation, the issue
described is not Linux-specific, and can occur on any operating system.
What is the issue?
Sometimes, we have to send an unknown amount of data from one location to
another. TCP, the reliable Transmission Control Protocol, sounds like it is
exactly what we need. From the Linux tcp(7) manpage:
“TCP provides a reliable, stream-oriented, full-duplex
connection between two sockets on top of ip(7), for both v4 and v6
versions. TCP guarantees that the data arrives in order and
retransmits lost packets. It generates and checks a per-packet
checksum to catch transmission errors.”
However, when we naively use TCP to just send the data we need to transmit,
it often fails to do what we want - with the final kilobytes or sometimes
megabytes of data transmitted never arriving.
Let’s say we run the following two programs on two POSIX compliant operating
systems, with the intention of sending 1 million bytes from program A to
program B (programs can be found here):
A:
sock = socket(AF_INET, SOCK_STREAM, 0);
connect(sock, &remote, sizeof(remote));
write(sock, buffer, 1000000); // returns 1000000
close(sock);
B:
int sock = socket(AF_INET, SOCK_STREAM, 0);
bind(sock, &local, sizeof(local));
listen(sock, 128);
int client=accept(sock, &local, locallen);
write(client, "220 Welcome\r\n", 13);
int bytesRead=0, res;
for(;;) {
res = read(client, buffer, 4096);
if(res < 0) {
perror("read");
exit(1);
}
if(!res)
break;
bytesRead += res;
}
printf("%d\n", bytesRead);
Quiz question - what will program B print on completion?
A) 1000000
B) something less than 1000000
C) it will exit reporting an error
D) could be any of the above
The right answer, sadly, is ‘D’. But how could this happen? Program A
reported that all data had been sent correctly!
What is going on
Sending data over a TCP socket really does not offer the same ‘it hit the
disk’ semantics as writing to a normal file does (if you remember to call
fsync()).
In fact, all a successful write() in the TCP world means is that the kernel
has accepted your data, and will now try to transmit it in its own sweet
time. Even when the kernel feels that the packets carrying your data have
been sent, in reality, they’ve only been handed off to the network adapter,
which might actually even send the packets when it feels like it.
From that point on, the data will traverse many such adapters and queues
over the network, until it arrives at the remote host. The kernel there will
acknowledge the data on receipt, and if the process that owns the socket is
actually paying attention and trying to read from it, the data will finally
have arrived at the application, and in filesystem speak, ‘hit the disk’.
Note that the acknowledgment sent out only means the kernel saw the data -
it does not mean the application did!
OK, I get all that, but why didn’t all data arrive in the example above?
When we issue a close() on a TCP/IP socket, depending on the circumstances,
the kernel may do exactly that: close down the socket, and with it the
TCP/IP connection that goes with it.
And this does in fact happen - even though some of your data was still
waiting to be sent, or had been sent but not acknowledged: the kernel can
close the whole connection.
This issue has led to a large number of postings on mailing lists, Usenet and
fora, and these all quickly zero in on the SO_LINGER socket option, which
appears to have been written with just this issue in mind:
“When enabled, a close(2) or shutdown(2) will not return until all
queued messages for the socket have been successfully sent or the
linger timeout has been reached. Otherwise, the call returns
immediately and the closing is done in the background. When the
socket is closed as part of exit(2), it always lingers in the
background.”
So, we set this option, rerun our program. And it still does not work, not
all our million bytes arrive.
How come?
It turns out that in this case, section 4.2.2.13 of RFC 1122 tells us that a
close() with any pending readable data could lead to an immediate reset
being sent.
“A host MAY implement a ‘half-duplex’ TCP close sequence, so that an
application that has called CLOSE cannot continue to read data from
the connection. If such a host issues a CLOSE call while received
data is still pending in TCP, or if new data is received after
CLOSE is called, its TCP SHOULD send a RST to show that data was
lost.”
And in our case, we have such data pending: the “220 Welcome\r\n” we
transmitted in program B, but never read in program A!
If that line has not been sent by program B, it is most likely that all our
data would have arrived correctly.
So, if we read that data first, and LINGER, are we good to go?
Not really. The close() call really does not convey what we are trying to
tell the kernel: please close the connection after sending all the data I
submitted through write().
Luckily, the system call shutdown() is available, which tells the kernel
exactly this. However, it alone is not enough. When shutdown() returns, we
still have no indication that everything was received by program B.
What we can do however is issue a shutdown(), which will lead to a FIN
packet being sent to program B. Program B in turn will close down its
socket, and we can detect this from program A: a subsequent read() will
return 0.
Program A now becomes:
sock = socket(AF_INET, SOCK_STREAM, 0);
connect(sock, &remote, sizeof(remote));
write(sock, buffer, 1000000); // returns 1000000
shutdown(sock, SHUT_WR);
for(;;) {
res=read(sock, buffer, 4000);
if(res < 0) {
perror("reading");
exit(1);
}
if(!res)
break;
}
close(sock);
So is this perfection?
Well.. If we look at the HTTP protocol, there data is usually sent with
length information included, either at the beginning of an HTTP response, or
in the course of transmitting information (so called ‘chunked’ mode).
And they do this for a reason. Only in this way can the receiving end be
sure it received all information that it was sent.
Using the shutdown() technique above really only tells us that the remote
closed the connection. It does not actually guarantee that all data was
received correctly by program B.
The best advice is to send length information, and to have the remote
program actively acknowledge that all data was received.
This only works if you have the ability to choose your own protocol, of
course.
What else can be done?
If you need to deliver streaming data to a ‘stupid TCP/IP hole in the wall’,
as I’ve had to do a number of times, it may be impossible to follow the sage
advice above about sending length information, and getting acknowledgments.
In such cases, it may not be good enough to accept the closing of the
receiving side of the socket as an indication that everything arrived.
Luckily, it turns out that Linux keeps track of the amount of unacknowledged
data, which can be queried using the SIOCOUTQ ioctl(). Once we see this
number hit 0, we can be reasonably sure our data reached at least the remote
operating system.
Unlike the shutdown() technique described above, SIOCOUTQ appears to be
Linux-specific. Updates for other operating systems are welcome.
The sample code contains an example of how to use SIOCOUTQ.
But how come it ‘just worked’ lots of times!
As long as you have no unread pending data, the star and moon are aligned
correctly, your operating system is of a certain version, you may remain
blissfully unimpacted by the story above, and things will quite often ‘just
work’. But don’t count on it.
Some notes on non-blocking sockets
Volumes of communications have been devoted the the intricacies of SO_LINGER
versus non-blocking (O_NONBLOCK) sockets. From what I can tell, the final
word is: don’t do it. Rely on the shutdown()-followed-by-read()-eof
technique instead. Using the appropriate calls to poll/epoll/select(), of
course.
A few words on the Linux sendfile() and splice() system calls
It should also be noted that the Linux system calls sendfile() and splice()
hit a spot in between - these usually manage to deliver the contents of the
file to be sent, even if you immediately call close() after they return.
This has to do with the fact that splice() (on which sendfile() is based)
can only safely return after all packets have hit the TCP stack since it is
zero copy, and can’t very well change its behaviour if you modify a file
after the call returns!
Please note that the functions do not wait until all the data has been acknowledged, it only waits until it has been sent.
Posted in Linux, Netherlabs | 3 comments
Posted by bert hubert
Mon, 08 Dec 2008 20:51:00 GMT
Ok, I like to think a lot, and I think I know a lot. Sometimes, this leads me to conclusions. Conclusions are only interesting if they are unexpected, but “afterwards”, nothing is ever unexpected. So, to turn a conclusion into an interesting conclusion, it has to become.. a prediction.
I read an interesting book some time ago, What we believe, but cannot prove, read my predictions below in the spirit of that book.
I’m not 100% convinced about the four beliefs outed below - and not all of them might be very novel - but I’ve been pondering this enough that I felt that I needed to write it down now in lieu of having to say that “I’ve been saying that all along!” later.
DNA: I believe 3 gigabases will turn out not to be enough
I no longer believe 3 gigabases of DNA are enough to generate a human being. This boils down to 750 megabytes (to read more about DNA through the eyes of a computer person, head here).
Furthermore, and there is debate about what it means, 97% of our DNA or so is considered to be ‘non-coding’. If this means the contents are not relevant for producing a homo sapiens, that means we would be down to around 22 megabytes of code.
This is less than is involved in a simple program like the PowerDNS Nameserver - less by a long margin if we include the libraries included in this program.
So either there is more to it than DNA (other code hiding elsewhere, for example), or DNA is astoundingly concise as a language to encode biological organisms.
The latter looks exceedingly unlikely since DNA, like the programming language Python, uses “whitespace” or repetition to encode structure - and this is not very efficient.
So much so that even if all 97% of the non-coding DNA would be relevant, I still don’t believe 750 megabytes will cut it.
I believe (almost know) that cancer will turn out to be related to the “halting problem”
I wrote about this before way back in 2002. One of the central theories of computer science (if we can call it a science!) is the so called ‘Halting problem’.
Tracing it roots to the Entscheidungsproblem defined in 1928, the Halting Problem can be stated as follows: given a description of a program and a finite input, decide whether the program finishes running or will run forever, given that input.
Alan Turing, whom we owe so much, like possibly, our freedom, proved that it is not possible to inspect an algorithm and state confidently that it will ever finish to run.
This has very important implications for programmers, who would often like to know this very thing: will this program hang? Will it hog my resources? Valiant attempts have been made, but thanks to Turing, we know that we don’t even need to try: it can’t be done, except by running the program, and finding out the hard way.
Organisms face very similar problems. As outlined in the DNA for computer programmers page mentioned above, each cell can (without stretching the truth) be regarded as a computer running a host of computer programs, all simultaneously.
One of the functions a cell can perform is to clone itself. This is a vital procedure, since this is the only way to get from one fertilized egg to a whole organism. Interestingly, most human cells have divided only a few dozen times in their entire life. Since each division doubles the number of cells, this quickly leads to enough material to form a viable organism.
But this is where the problem lies. If a cell keeps on dividing, it exhausts all resources, and forms a tumor - which endangers the rest of the organism. To prevent this, the body polices its cells aggressively. A host of mechanisms are at work which inspect cells for damage to its DNA, and conversely, a damaged cell broadcasts this fact, effectively calling for its own cleanup. There is even a crude ‘division counter’ at work, which attempts to make cells burn out when they divide too much.
I believe, but cannot prove, that organisms are in the same boat as computer programmers. A cell that keeps dividing is like a runaway computer program, one that will never cease running. And no matter how hard the body tries, it is not possible to write a cast-iron solution that will allow us to detect which cells will do this and which won’t. And thus cancer is effectively the biological equivalent of the halting problem.
Obesity: I believe being hungry every once in a while is relevant
One of the big questions of our time. How come lots of people are so fat these days? Some of the obvious answers don’t appear to hold up, even though they sound plausible (that we would be eating more or moving less). Some quite new things are being investigated right now, like early exposure to certain PCB’s, or Bisphenol-A.
When something big changes, and you don’t know why, it is often good to look at other things that might have happened to cause that change.
I believe one big change has been that affluent people in the “West” never experience hunger anymore. Lots of other things changed too of course, like the kind of food we eat, and how much we do by car etc.
But I think the interesting thing is that whenever we are hungry, we do something about it. In the past, breakfast, lunch and dinner were served at set times, and there was no such thing as a ‘snack’.
So I believe, but cannot prove, that it might just be that the body needs a reminder that the supply is finite (by experiencing hunger before a meal), and that the fact that this rarely happens anymore this is a major reason why obesity is on the rise.
I believe that daylight will turn out to be more important than vitamin D, and that in a few years we’ll see health advisories about ‘being outdoors’.
Over the past few years, not a month has passed without some major study describing how higher serum levels of vitamin D are associated with good health. I have to choose my wording precisely, since the typical headline reads ‘Vitamin D prevents cancer’ - which is not what most researchers have been saying.
In 2008, I’ve noticed that more studies have cropped up that report that ingesting additional vitamin D does not have such stellar health benefits.
I believe, but cannot prove, that it will ultimately turn out that ‘being outdoors’ has the tremendous health benefits. One side effect of being outdoors is raised serum levels of vitamin D - hence the results of earlier studies.
If this belief is correct, please start making windows that are more transparent, and take some of these health benefits indoors. Regular glass blocks large parts of the solar spectrum. For additional points, develop a solution that blocks most UV light in summer, but passes it through in winter.
And please don’t patent this. And since this idea is now online, you can’t. Ha.
Summarising
Thanks for bearing with me during this long post. If the above seems controversial, or obvious, consider what has been called ”Bernal’s Ladder”, describing the four stages of any theory in the scientific community.
- It can’t be right
- It might be right, but it is not important
- It might be important, but it is not original
- It’s what I always thought myself
I would love for any of the beliefs outlined above to be at any point on this ladder.
no comments
Posted by bert hubert
Sun, 16 Nov 2008 21:21:00 GMT
- Idea - estimates for time to completion range from 3 days to 3 weeks
- Pretty convincing first stab ‘look how cool this would be’
- The Hard Slog to get something that actually works. Estimates now range from 3 months to 3 years.
- First real users pop up, discovery is made that all assumptions were off
- Starts to look good to the first real user
- Elation!
- Someone actually uses the code it for real, the bugs come out in droves
- A zillion bugs get addressed, harsh words are spoken
- Elation!
- The guy you had previously told that 100 million users would not ‘in principle’ be a problem actually took your word for it, and deployed it on said user base. Harsh words are spoken.
- Fundamentals are reviewed, large fractions of the code base reworked
- Product now actually does what everybody hoped it would do.
- Even very unlikely bugs have cropped up by now, and have been addressed. Even rare use cases are now taken into account.
- If a user complains of a crash at this stage, you can voice doubts about the quality of his hardware or operating system.
PowerDNS went through all these stages, and took around 5 years to do so.
Not all parts are at ‘stage 14’ yet, but for the Recursor, I seriously ask people to run ‘memtest’ if they report a crash.
The above 14 points are never traversed without users that care. For PowerDNS, step ‘4’ was performed by Amaze Internet and step ‘7’ by ISP Services. 1&1 (called Schlund back then) was instrumental in step ‘10’ when they started using it on millions of domains.
For the PowerDNS Recursor, steps ‘4’ and ‘7’ not only happened over at XS4ALL, but they also paid for it all!
Step ‘10’ occurred over at AOL and Neuf Cegetel, who together connected the Recursor to 35 million subscribers or so.
Finally, the parts of PowerDNS that have reached the end of the list above have done so because of literally hundreds if not thousands of operators that have made the effort to report their issues, or voice their wishes.
Many thanks to everybody!
Hmm, the above does not sound very professional..
I’ve heard the theory that some people think they can plan software development more professionally. I used to believe them too. But any real project I’ve heard of went through the stages listed above. No schedule, no Microsoft Project sheet, no Gantt Chart I know about ever even came close to reality.
But I’d love to be wrong, because I agree fully that it would be great if software development was more predictable.
This is especially true since the aforementioned “process” necessarily involves several very committed users, who have to voice the harsh words, but do have to stick with the project.
So please comment away if your real life experiences are different - I’d love to hear!
Posted in PowerDNS, Netherlabs | no comments
Posted by bert hubert
Thu, 18 Sep 2008 19:44:00 GMT
After too much posting on IETF mailing lists, and not achieving anything, I’ve gone back to coding a bit more.
There are two things I want to share - the first because I had a devil of a time figuring out how to do something, and I hope that posting here will help fellow-sufferers find the solution via Google. The second thing I want to talk about because, and this is getting to be rare, I programmed something cool, and I just need to tell someone about it.
I pondered explaining it to my lovely son Maurits (4 months old today), but I don’t want to ruin his brain.
Debugging iterators
In most programming languages there are a lot of things that compile just fine, or generate no parsing errors at runtime, but are still accidents waiting to happen.
Tools abound to expose such silent errors, usually at a horrendous performance cost. But this is fine, as errors can be found by the developer, and fixed before release.
As part of our arsenal, we have the veritable Valgrind that detects things such as reading from memory that had not previously been written to. In addition, other tricks are available, such as changing functions that ‘mostly do X, and rarely Y’ so that they always to Y. This quickly finds programs that skipped dealing with Y (which might be a rare error condition, or realloc(2) returning a new memory address for your data).
Finally, many programming environments by default perform very little checking (in the interest of speed) - for example, they will gladly allow you to compare something that points to data in collection A to something that points to collection B - a comparison that never makes sense, classical apples and oranges.
My favorite C++ compiler, G++, comes with so called ‘debugging iterators’ that are very strict - anything that is not absolutely correct becomes an error, sometimes at compile time, sometimes at runtime.
Together with Valgrind, this is one of the techniques I like to whip out when the going gets tough.
Sadly, Debugging iterators (which are turned on by adding -DGLIBCXXDEBUG conflict with one of my favorite C++ libraries, Boost.
To make a long story short, to compile a version of Boost with debugging iterators, issue:
$ bjam define=_GLIBCXX_DEBUG
This single line of text may not look all that important, but it took me half a day of debugging to figure this out. So if you get this error:
dnsgram.o:
(.rodata._ZTVN5boost15program_options11typed_valueISscEE[vtable for
boost::program_options::typed_value, std::allocator >, char>]+0x18):
undefined reference to
`boost::program_options::value_semantic_codecvt_helper::parse(boost::any&,
std::__debug::vector,
std::allocator >, std::allocator, std::allocator > > > const&, bool) const'
Then compile your own version of boost as outlined above.
C++ Introspection & Statistics
C++ is an old-school language, perhaps the most modern language of the old school. This means that it sacrifices a lot of things to allow programs to run at stunning ‘near bare metal’ speeds. One of the things that C++ does not offer therefore is ‘introspection’
What this means is that if you have a class called “ImportantClass”, that class does not know its own name at runtime. When a program is running, it is not possible to ask by name for an “ImportantClass” to be instantiated.
If you need this ability, you need to register your ImportantClass manually by its name “ImportantClass”, and store a pointer to a function that creates an ImportantClass for you when you need it.
Doing so manually is usually not a problem, except of course when it is. In PowerDNS, I allocate a heap (or a stack even) of runtime statistics. Each of those statistics is a variable (or a function) with a certain name.
In more modern languages, it would probably be easy to group all these variables together (with names like numQueries, numAnswers, nomUDPQueries etc), and allow these statistics to be queried using their names. So, an external program might call ‘get stat numQueries’, and PowerDNS would look up the numQueries name, and return its value.
No such luck in C or C++!
So - can we figure out something smart, say, with a macro? Yes and no. The problem is that when we declare a variable in C which we want to be accessible from elsewhere in the program, it needs to happen either inside a struct or class, or at global scope. This in turn means that we can’t execute code there. So, what we’d like to do, but can’t is:
struct Statistics {
uint64_t numPackets;
registerName(&numPackets, "numPackets");
uint64_t numAnswers;
registerName(&numAnswers, "numAnswers");
} stats;
stats.numPackets is indeed available, but the line after its definition will generate an error. This is sad, since the above could easily be generated from a macro, so we could do:
DEFINESTAT(numPackets, “Number of packets received”);
Which would simultaneously define numPackets, as well as make it available as “numPackets”, and store a nice description of it somewhere.
But alas, this is all not possible because of the reasons outlined above.
So - how do we separate the data structure from the ‘registerName()’ calls, while retaining the cool ‘DEFINESTAT’ feature where everything is in one place?
In C++, files can be included using the #include statement. Most of the time, this is used to include so called ‘header’ files - but nothing is stopping us from using this feature for our own purposes.
The trick is to put all the ‘DEFINESTAT’ statements in an include file, and include it not once, but twice!
First we do:
#define DEFINESTAT(name, desc) uint64_t name;
struct Statistics {
#include "statistics.h"
} stats;
This defines the Statistics struct, containing all the variables we want to make available. These are nicely expanded using our DEFINESTAT definition.
Secondly, we #undefine DEFINESTAT again, and redefine it as:
#define DEFINESTAT(name, desc) registerName(&stats.name, #name, desc);
Then we insert this in a function:
void registerAllNames()
{
#include "statistics.h"
}
This will cause the same statistics.h file to be loaded again, with the same DEFINESTAT lines in there, but this time DEFINESTAT expands to a call that registers each variable, its name (#name expands to “name”), and its description.
The rest of our source can now call ‘stats.numPackets++’, and if someone wants to query the “numPackets” variable, it is available easily enough through its name since it has been registered using registerName.
The upshot of this all is that we have gained the ability to ‘introspect’ our Statistics structure, without any runtime overhead nor any further language support.
As stated above, more modern languages make this process easier.. but not as fast!
I hope you enjoyed this arcane coolness as much as I did. But I doubt it :-)
Posted in Linux, PowerDNS, Netherlabs | no comments
Posted by bert hubert
Mon, 04 Aug 2008 22:15:00 GMT
This post sets out to calculate how hard it is to spoof a resolver that
takes simple, unilateral, steps to prevent such spoofing.
Unilateral in this case means that any resolver can implement these steps,
without changing the DNS protocol or authoritative server behaviour.
Everybody that implements the ideas below immediately improves the general
security of the DNS.
To save you all the reading, the simple unilateral measures can bring down
the chance to be spoofed to under 1% after a year of non-stop 50 megabit/s
packet blasting.
Note however that my math or my ideas may be wrong. Please read carefully!
Work so far
Recapping, calculations so far show that a fully source port randomized
nameserver can be spoofed with a 64% chance of success within 24 hours,
requiring around 0.4TB of packets to be generated. If 2 hours are available,
the chance of success is 8.1%.
This assumes that the attacker is able to generate around 50 megabits/s, and
also important, that the resolver is able to process 50k incoming responses/s.
Details are on
http://www.ops.ietf.org/lists/namedroppers/namedroppers.2008/msg01194.html
with the caveat that where it says “1.5 - 2 GB”, it should say “36GB”.
Since that posting, quite a number of people have studied the calculations,
and they appear to hold up.
Note that the present post does not address the dangers created by those
able to actively intercept and modify traffic - people with such abilities
have little need to spoof DNS anyhow.
Situation
The current situation is not acceptable - the resources needed to perform a
successful spoofing attack are available, if not generally, then to an
relevant subset of Internet users.
It turns out that it takes a lot of effort to get the world to apply even a
minor patch that has received tremendous front-page coverage on all the
security websites - the source port randomisation patches in this case.
So if we want to move quickly, we need a solution that can be rolled out
without having to upgrade large parts of the Internet.
Agile countermeasures
There are a number of strategies a resolver could employ to make itself
effectively unspoofeable, some of these include:
A) Sending out questions over TCP/IP
B) Repeating questions a number of times, and requiring the answers to be
equivalent
The problems with these two techniques is that they imply a certain overhead
and increase the general CPU utilization on the Internet.
Furthermore there are strategies that make spoofing harder, but not
impracticable:
C) Case games (‘dns-0x20’)
D) Use multiple source addresses
A very comprehensive enumeration of techniques can be found on
http://www.ops.ietf.org/lists/namedroppers/namedroppers.2008/msg01131.html
Sadly, it appears that on this impressive list there is nothing without more
or less unacceptable overhead that really gets us out of the woods, where
this is arbitrarily defined as reducing the spoofing success rate to below
1% after a year of non-stop trying at 50 megabit/s.
Detecting a spoofing attempt in progress
Since any spoofed response has a chance of around 2^-32 of being accepted,
it stands to reason around 2^31 bogus responses will arrive at the resolver
before the attacker manages to achieve his goals.
Since we know we have effective countermeasures available, like A and B
mentioned above, we could deploy these in case a spoofing attempt is
detected. Remember that A and B are generally available, but that we don’t
want to resort to them all the time, for all domains, because of their
overhead.
Occasionally, port numbers get modified in transit. Additionally, responses
to queries sometimes arrive late enough that a new equivalent query has
since been sent. This means we should not consider a single response
mismatch to be a sign of a spoofing attempt in progress.
If we allow X mismatches before falling back on A or B, the chance of a
single query being spoofed is:
X
--------- ~= 2^-32 * X
N * P * I
Assuming each individual attack lasts W seconds (being latency between the
authentic authoritative server and the resolver), the combined spoofing
chance after T seconds becomes:
T/W
1 - (1 - 2^-32 * X) ~= 2^-32 * X * T/W
Putting in 20 for X and 0.1s for W, this gets us a combined spoofing chance
of 0.4% for a full day. Interesting, but not good enough, especially since
the attacker might well send only X packets per attempt, and launch far more
attempts.
However, if the attacker has a defined goal about what to spoof, each
successive attempt might be for a different domain name, or differ in other
respects, but all those attempts will share some characteristics.
Two things that will be identical, or at least reasonably unique are the
source address of the spoofed packet (aka, the network address of the
authentic authoritative server), plus the ‘zone cut apex’. This last bit
requires some understanding of the way a resolver works.
(I made up this phrase, ‘zone cut apex’, but I think there are people with
better knowledge of DNS verbiage than I, I’d love to hear of a better name)
When a resolver asks NS1.EXAMPLE.COM for ‘this.is.a.long.example.com’, the
resolver knows it is asking NS1.EXAMPLE.COM in its role as ‘EXAMPLE.COM’
nameserver - this is how it selected that server.
This means that an attacker might try many variations on
‘this.is.a.long.example.com’, what will remain identical is the ‘zone cut
apex’, which is ‘EXAMPLE.COM’. What will also remain identical is the
(small) set of example.com servers available.
I’ll get into more detail after Dan Kaminksy has held his presentation. The
upshot however is that multiple different attempts can be correlated, and
thus be counted together in the spoofing-detection counts.
If we conservatively decide to impose a 5 minute ‘fallback to A or B’-regime
for a {source IP, zone cut apex}-tuple, and leave the detection limit at 20,
this means an attacker will have one chance every 5 minutes of getting in an
attempt.
This is equivalent to setting W equal to 300 seconds above, yielding us a
combined chance of spoofing a domain after a year of trying of 0.05%.
Well within our goal.
Reality intrudes
Sadly, the reality is that we won’t recognize all spoofed packets that
guessed wrong, so to speak. Typical operating systems will only let a
nameserver know about packets that arrived on a socket open for that
purpose.
In the very worst case, the server is only listening on a single port, and
by the time a single mismatch is received by the nameserver process, on
average 32000 will have arrived on the network interface.
This means that in the calculation above, if we don’t take additional
measures, we need to set X to 32000, leading to a combined monthly
spoofing chance of 6.4% (and a yearly near-certainty).
Fine tuning things
By raising the fallback period to an hour, the yearly spoofing chance
becomes 6.5%, assuming we only see 1 in every 32000 spoof attempts.
If in addition a small number of sockets is kept open, say 10, to function as
‘canary ports’, X reduces to 3200, and the yearly spoofing chance is back at
a low 0.65%.
Canary ports serve no function except to detect spoofing attempts. For
efficiency reasons, these ports may simply be ports that had previously been
used for source port randomisation, but kept around somewhat longer.
The number of canary ports, and the fallback period can be tuned at will to
achieve desired spoofing resilience levels.
Remaining Problems
Countermeasure ‘A’ does not work for domains not offering TCP service.
Countermeasure ‘B’ does not work for domains where single authoritative
servers generate differing answers to repetitive questions. This might
happen in case of (hardware) load balancers, or load balanced desynchronised
nameservers.
Best results might be achieved by alternately trying countermeasure A and B - any server that does not support TCP and sends out inconsistent replies
is in for some tough love if someone attempts to spoof the domains it hosts.
Conclusion
If the calculations and ideas elaborated above hold up, it appears feasible
to achieve arbitrarily low spoofing chances, without doing any further
protocol work.
Importantly, such changes would allow individual resolvers to protect their
users without having to wait for changes to all authoritative servers.
In other words, everybody who participates receives an immediate benefit.
1 comment
Posted by bert hubert
Wed, 09 Jul 2008 19:31:00 GMT
Yesterday it was announced that there is an unspecified but major DNS vulnerability, and that Microsoft, Nominum and ISC had fixes available.
It is amusing to note that this has been hailed as a major feat of cooperation, with the vulnerable parties spinned as being part of secret industry cabal that has just saved the world from very bad things.
To say the least, I find this a funny way of presenting things! The vulnerability is still not public, but the secret cabal shared it with me. Perhaps it is fair to say I am part of the cabal - I nearly traveled to the secret meeting at the Microsoft campus, but the imminent birth of my son made me decide not to go.
The DNS vulnerability that has been presented yesterday is indeed a very serious problem, and I am glad steps are now taken to fix the broken software that was vulnerable. Dan Kaminksy is to be praised for discovering the issue and coordinating the release.
However - the parties involved aren’t to be lauded for their current fix. Far from it. It has been known since 1999 that all nameserver implementations were vulnerable for issues like the one we are facing now. In 1999, Dan J. Bernstein released his nameserver (djbdns), which already contained the countermeasures being rushed into service now. Let me repeat this. Wise people already saw this one coming 9 years ago, and had a fix in place.
In 2006 when my own resolving nameserver entered the scene, I decided to use the same security strategy as implemented in djbdns (it is always better to steal a great idea than to think up a mediocre one!). Some time after that, I realised none of the other nameservers had chosen to do so, and I embarked on an effort to move the IETF DNS-EXT working group to standardise and thus mandate this high security behaviour.
This didn’t really go anywhere, but some months ago I noticed particularly strenuous resistance in the standardisation of the so called ‘forgery resilience draft’, and after some prodding it became clear it was felt my draft was in danger of drawing attention to the then unannounced DNS vulnerability, and that it were best if we’d all shut up about it for a few months, perhaps until July 2008 until all the vendors would have had time to get their act together.
And now we’ve seen the release, and it is being hailed as great news. But it isn’t. Dan Bernstein has been ignored since 1999 when he said something should be done. I’ve been ignored since 2006. The IETF standardisation languished for two years.
This is not a success story. It has in fact been a remarkable failure.
To end on a positive note - I am very glad Dan Kaminsky’s work caused some collective eye opening, and I hope good things come from this. DNS has long lacked critical attention, and in the end this might bring about sorely needed improvements.
DNS very recently celebrated its 25th birthday - I look forward to seeing the venerable Domain Name System succeed in the coming 25 years!
Posted in PowerDNS, Netherlabs | 8 comments
Posted by bert hubert
Sat, 24 May 2008 08:37:00 GMT
18th of May, Delft, The Netherlands
Mirjam & Bert are proud to announce the birth of their son Maurits Hubert! Mother, son & father are doing very well.
Feel free to email the little guy on maurits@hubertnet.nl!
Picture when Maurits was only an hour old:
And a slightly geeky Droste Effect photo:
Posted in Life | no comments
Posted by bert hubert
Tue, 13 Nov 2007 12:42:00 GMT
Exactly one year ago today, my father passed away, less than a year after my mother did.
Here you can see them in happier times, together with the other subject of this post:
While we mourn their passing today, not all news is bad. I’m happy to announce Mirjam and I are expecting a baby!
We’re very happy, but sad we won’t be able to share the good news with my parents. But: life goes on - which is literally true in this case.
Bert & Mirjam
Posted in Life | 3 comments
Posted by bert hubert
Sun, 11 Nov 2007 10:36:00 GMT
While running the risk of turning this blog into a lecture series, I can’t
resist. This post will dive into cryptography, and I hope to be able to
transfer the sense of wonder that caught me when I first read about Diffie-Hellman
key exchange many years ago.
Let’s assume you are in a room with two other people, and that you want to
share a secret with one of them, but not with the other. In the tradition of
cryptography, we’ll call these three people Alice (you), Bob (your friend)
and Eve (who wants to ‘Eavesdrop’ on your secrets).
Let’s also assume that the room is very quiet, so you can’t whisper, and
everybody can hear what everybody else is saying. Furthermore, you are far enough away that you can’t pass paper messages.
So how could you (Alice) share a secret with Bob? Anything you want to tell
Bob, will be overheard by Eve. You might try to think up a code, but you’ll
still have to explain the code, and both Bob and Eve will hear it.
It turns out that using the magic of public key cryptography, this is
possible - sharing a secret while people are listening in.
The room with Alice, Bob and Eve is not a very relevant example, but replace
Alice by ‘The allied forces’, ‘Bob’ by a resistance fighter equipped with a
radio, and ‘Eve’ by the occupying enemy, and things start to make sense.
Or, in today’s terms, replace Bob by Amazon.com, and Eve by a hacker
interested in getting your credit card number.
So how does it work?
To send a secret, two things are needed: an ‘algorithm’ and a ‘key’. A famous
algorithm is the ‘Caesar cypher’, which consists of shifting all letters by
a fixed amount. So an A might become a B, a B would become a C etc etc.
The key in this case is how much you want to shift the letters, in the
sample above the key is ‘1’. If the key had been ‘2’, an A would’ve become a
C, a B would’ve become a D etc.
Typically, you can discuss the algorithm in public, but you need to keep the
key secret. In terms of Alice and Bob, they will be able to communicate in
secret once they’ve been able to establish a key that Eve does not know
about.
Once everybody has agreed to use the Caesar cypher, the problem shifts to
exchanging how many letters we will shift. We can’t just say this out loud,
since both Bob and Eve will hear it.
Diffie-Hellman
Way back in 1976, Whitfield Diffie and Martin Hellman published the details
of what has become known as the Diffie-Hellman key exchange algorithm,
although they both credit Ralph Merkle with some of the key ideas.
The process basically works as follows. Alice and Bob each think of a random
number, that they keep a secret. Then they both do some calculations based
on this number, and say the result of those calculations out loud.
Then both Alice and Bob combine the results of the calculations with their own
secret random number, and out pops a shared random secret number. This
shared random secret number is now known by Alice and Bob, but not by Eve.
And it is this secret that now becomes the key.
How is this possible?
Eve heard both Alice and Bob say a random number, exactly the same numbers
that Alice and Bob heard. Yet only Alice and Bob now know the shared secret.
How is this possible?
The trick lies in the calculation, by which means Alice and Bob only shared
parts of the numbers they chose initially. Then both Alice and Bob combined
those parts with their full random numbers.
It is this trick of revealing only parts of random numbers, and then
combining the part of the other party with your full number, that leads to a
shared secret.
Show me
On this page I wrote a very simple Diffie-Hellman example program that runs entirely within your web browser. You can either use it alone, or with a friend - which is the most fun. It works over the phone, or over an instant messenger (IRC, MSN etc). Follow the instructions, encode a message, paste it to your friend, and if your friend followed the instructions, and he pastes the encoded message into the decoder, he should see your secret message!
This is even more fun in a chat room with actual Eve’s present.
Please be aware that the sample is a joke - don’t use it to share real secrets! However, the technology it employs is real, and this truly is how people exchange keys - only the numbers are far larger (300 digits), and the actual encryption is not a Caesar cypher.
So how does it really work?
More information can be found on the wikipedia page about Diffie-Hellman, especially in the ‘external links’ section.
Posted in PowerDNS | 1 comment