The perfect error message

Posted by bert hubert Fri, 08 Sep 2006 13:36:00 GMT

Programming is a lot of things. One of them is writing good error messages. We tend to think that errors are rare and this should of course be so. However, sometimes they are not, and in that case, good reporting is vital to quickly resolve the problem.

So, even though we should make sure errors do not happen, if they do, our error messages should be top notch.

About error messages

Purpose

For operators, they are vital aids in configuring software
For system adminstrators, they show which external problems need to be resolved (disk full, network down, etc).
Should a program crash, the authors often only have error messages as clues to why this happened. Many crashes are preceded by errors that are reported. A good error can help generate a bug fix.

Taxonomy of error messages

Configuration problems, for example, looking for a file in directory A while it resides in directory B;
Unavailable resources (disk full, out of memory), connectivity problems;
Should Never Happen.

Configuration problems

These commonly occur while software is being installed and setup. Good error reporting is of utmost importance here, as it serves two purposes:

Educating the operator about how the program functions;
Explaining what needs to be fixed.

Ad 1, an error like “Can’t start frobnicator because the discombulator is not running” teaches the operator that a frobnicator needs a discombulator. This knowledge can of course also be gleened from the documentation, but in this case, repetition is a good thing.

Compare this error to “Can’t start process: Connection refused”, for example, and think about how helpful that is.

Ad 2, a report like “Can’t connect to product database using connection string ‘dbuser=john, dbname=changeme’: No such database” immediately tells the operator what he needs to know.

Unavailable resources

These typically occur while a program is already running and installed, but are nonetheless important. Do not log ‘Disk full’, but report ‘Disk full writing to ….’ so the operator knows which disk filled up. If a server could not be reached, log the IP address and possibly the name of the server. Any discrepancy between the two may point out a DNS configuration error.

Out of memory is typically very hard to deal with, except when something really odd was going on. A typical example is trying to allocate a ridiculous amount of memory because of an earlier error, in which case logging what memory was being allocated for might help debug the problem. It typically will not help the user of a program.

Should never happen

These are errors of the program itself. Programmers quite often test for impossible conditions, especially if they sense these might conceivably happen one day. An example might be a module that can only connect to IPv4 servers that gets confronted by an IPv6 socket, which in turn leads calls to determine the IPv4 remote address to fail.

It is tempting to be quite rude in these messages, or say stuff like “should never happen!!”, but resist these urges. One day a ‘should not happen’ error is going to be a vital debugging clue. These errors are rare, but it pays to go, well, the extra few yards to perhaps report “unexpected address family accepting connection!”.

Guidelines

An error message should contain:

Who is reporting the error (which subsystem, which program, which module)
The action that failed
The subject of the action (a directory, a server, a port number, an IP address)
As good an indication of the actual error as possible.
Optional - what the program is doing about it

An excellent error message is:

Webserver can't serve page, error opening file '/var/www/index.html': Permission denied, reporting HTTP 404 error

Ad 1, it is tempting to include filenames, function names, and line numbers here. OpenSSL does this a lot, for example. However, almost none of your intended audience will be able to extract useful information from the fact that the error occurred on line 123 of ‘multiplexer.c’. Make sure the module means something to the operator. It may be as simple as the name of your program.

Ad 2, this helps the operator determine if this error might be the explanation of observed problems. An error like “Webserver failed to increase TCP buffer size, continuing with default” can immediately be ruled out as an explanation for why people can’t log in to their mail.

Ad 3, an error like “Can’t open file” on its own can mean many things. One of which might be that it is not reading the configuration file you think it is, and trying to open your index.html in ‘/usr/local/www’, and not in ‘/var/www’ like you thought you configured.

Ad 4, self explanatory. Take the trouble to convert error codes into strings. Many programmers may know what ‘errno = 111’ means ‘Connection refused’, but don’t count on your users knowing this.

Ad 5, this is a fine counterpoint to item 4. “Giving a 404” is very clear for any operator of a web server. Not all errors need a followup, so reporting what the program is doing about the error is not mandatory.

Conclusion

Good error messages can save your users many days of problems. And, suprisingly, you might yourself gain even more time - how well do you know the internals of your program after a few months?

So please please, both as a user and a progammer, I ask you, spend time on error messages!

Posted in Linux, PowerDNS, Netherlabs | 12 comments

Comments

RSS feed for this post

Comments are disabled

bert hubert finally blogs

code, musings and more