If you are reading this ...

The magic internet fairies have updated DNS entries, rsync is syncing, Apache is re-writing and CGI is gatewaying, and with some shoelaces and string everything is holding together.

In case anyone reading has not been calling me because they fear for my health should I have to pick up my phone -- rest assured that thanks to a new NICTA initiative my handset can now double as an operating theatre.

Telephone Disinfectant, front Telephone Disinfectant, back

Investigating Byzantine Failure

I was at the airshow watching a C-17 Globemaster do its thing and the guy giving the spiel says how it is a completely fly-by-wire system, with quadruple redundancy. I thought to myself "that's a lot of redundancy" as it zoomed overhead.

As it happened, we were talking about the Byzantine generals on Thursday, and it finally sunk in.

The Byzantine Generals is a description of coming to consensus in a distributed system. The original paper gets pretty hairy, but the results are worth remembering.

The problem states there are a number of generals deciding if they should attack the enemy. Communication between the generals is considered secure (in the simplest case) but the alliance of all the generals is unknown. Loyal generals obviously tell the truth, but a dis-loyal general can either lie or tell the truth. A majority of the generals must concur; divided they can not defeat the enemy.

The implications for a distributed system are fairly clear. Reliable communication might be as simple as TCP/IP, and although computers aren't going to lie they may be faulty and not telling you the right thing. If that involves the difference between being above a mountain or flying into it, you want to be sure you get it right.

If there are just two of us, we can see that if one of us might be a traitor we can not decide on a course of action. If I say the height is 1000 feet and you say it is 2000 feet, there is no real way to decide who is right.

Byzantine Generals

From the picture above, we can also see that if there are three of us, with one of us possibly disloyal, we can not ensure consistency. In the first case, the loyal general knows that one might be disloyal, but can't rely on the other lieutenant being loyal because he may have seen conflicting messages, and might go either way. So he can't be sure whether it will be an attack or retreat, so the situation is hopeless.

In the other case, the general is trying to confuse matters. He knows he is disloyal, so the other two are loyal. But this means he simply puts the system into an inconsistent state, where one lieutenant might go other way. Sounds a bit like something out of an episode of 24!

The paper shows that you can reduce all problems like this to some simple rules. It describes that to handle n faulty nodes, you require:

  1. 3n+1 working nodes
  2. 2n+1 independent communication channels
  3. n+1 rounds of communication

It turns out this is a real problem in real life and in real aeroplanes; the best paper I found was "The Real Byzantine Generals", Driscol, Hall, Paulitsch, Zumsteg, Digital Avionics Systems Conference, Salt Lake City, October 2004 (via IEEExpore if you have it). It mentions how the "anthropomorphic" basis for the problem may make many think they are immune to it, which is an interesting point.

So although 4 redundant systems sounds like a lot, it's just barely enough if you want the systems to have any sort of consensus.

Software Engineering via Balkanisation

We were talking today about the relative ease of moving Darwin (the core of OS X) to L4 the Darbat guys have experienced. Chuck noted that major parts of IOKit, Mach and BSD pull apart more easily than one might expect.

Luckily we had an ex Apple IOKit architect there, who noted that one large factor behind the good software engineering arose because none of those three groups ever talked to each other, sometimes even to the point of "I'm a BSD guy, I don't talk to the IOKit guys".

It got me to thinking that high impedance between groups working on tightly integrated components is probably a good thing. A good example might be how none of the projects to port Linux on top of L4 have ever really stuck and fallen to bit-rot from non-maintenance (of course Wombat does not fall into this category! :).

Maybe the problem is that the Linux community talks too much, and this leads developers to becoming too familiar and too willing to cross domain boundaries (this has been studied and shown to be a problem in papers such as Maintainability of the Linux Kernel, Scach et al.).

Of course prominent kernel developers are not keen on committing to APIs for various (good) reasons. Clearly a stable ABI is unworkable, and perhaps the impedance levels for making core changes are already quite high because you've got to get it past a raft of maintainers before it goes anywhere. But alternatively perhaps everyone giving their 2c's worth and trying to work together to couple their work as closely as possible causes more boundary crossings than otherwise might happen. "Don't do that there, we'll tweak this subsystem to do that and move that bit up a level behind a new interface" etc. -- you want systems to be close, but not so close that they can never be untangled again.

So, is making your teams hate each other a valid software engineering approach?

IA32 segments and fast system calls

I've previously mentioned the gate page on Linux, but I thought a little more information might be useful.

The reason why system calls are so slow on x86 goes back to the ingenious but complicated segmentation scheme used by the processor. The original reason for segmentation was to be able to use more than the 16 bits available in a register for an address, as illustrated below.

Segmentation with 64KiB segments

When x86 moved to 32 bit registers, the segmentation scheme remained but in a different format.

Rather than fixed segment sizes, segments are allowed to be any size. This means the processor needs to keep track of all these different segments and their sizees, which it does descriptors. The segment descriptors available to everyone are kept in the global descriptor table or GDT for short. Each process has a number of registers which point to entries in the GDT; these are the segments the process can access (there are also local descriptor tables, and it all interacts with task state segments, but that's not important now).

The situation is illustrated below.

Segmentation in a 32 bit x86 system

Since the processor knows what segments of memory the currently running process can access, it can enforce protection and ensure the process doesn't touch anything it is not supposed to. If it does go out of bounds, you receive a segmentation fault, which most programmers are familiar with.

The interesting bit comes when you want to make calls into code that resides in another segment. To implement a secure system, we can give segments a certain permission value. x86 does this with rings, where ring 0 is the highest permission, ring 3 is the lowest, and inner rings can access outer rings but not vice-versa.

Like any good nightclub, once you're inside "club ring 0" you can do anything you want. Consequently there's a bouncer on the door, in the form of a call gate. When ring 3 code wants to jump into ring 0 code, you have to go through the call gate. If you're on the door list, the processor gets bounced to a certain offset of code within the ring 0 segment.

This allows a whole hierarchy of segments and permissions between them. You might have noticed a cross segment call sounds exactly like a system call. If you've ever looked at Linux x86 assembly the standard way to make a system call is int 0x80, which raises interrupt 0x80. An interrupt stops the processor and goes to an interrupt gate, which then works the same as a call gate -- it bounces you off to some other area of code assuming you are allowed.

The problem with this scheme is that it is slow. It takes a lot of effort to do all this checking, and many registers need to be saved to get into the new code. And on the way back out, it all needs to be restored again.

On a modern system, the only thing that really happens with all this fancy segmentation switching is system calls, which essentially switch from mode 3 (userspace) to mode 0 and jump to the system call handler code inside the kernel. Thus sysenter (and sysexit to get back) speed up the whole process over an int 0x80 call by removing the general nature of a far call (i.e. the possibility of going to any segment, with any ring level) and restricting you to going to ring 0 code at a specific segment and offset, as stored in registers.

Because there are so many less options, the whole process can be speed up, and hence we have a fast system call. The other thing to note is that state is not preserved when the kernel gets control. The kernel has to be careful to not to destory state, but it also means it is free to only save as little state as is required to do the job, so can be much more efficient about it. If you're thinking "wow that sounds like insert favourite RISC processor", you're probably right.

The x86 truly is a weird and wonderful machine!

Commodore 64 and BASIC

I just came across an article on Salon lamenting the death of BASIC called "Why Johnny can't code". It reminded me of the the following photo, which clearly shows why I can code.

ian and a commodore 64

I don't know the exact date, but the fluro shoelaces gives a pretty good time-frame. Thanks to my Dad for firstly buying the Commodore 64, and then scanning in a bunch of old photos 15 years later!

Kernel Process Walker

If you're like me, you probably say things like "I wonder how the page tables of every process in the system are populated at the moment" a lot each day. Generally it comes down to "do something for each process running, and print it out".

I think my best effort yet is pview.c (with a Makefile to help build the module). It should also act as a good template for anyone wanting to use either debugfs or the seq_file interface with a queue, something which I have previously found lacking. I have found debugfs and seq_file a good combination; much easier to use than relayfs for non-performance critical medium to large output volumes from the kernel.

Just remember to take the right locks before you go poking around inside processes. Obviously this will perturb the system, but it is often handy for those "what if" questions.

Rusty Russell's "Hard to Misuse" rules

A helpful comment pointed me to Rusty Russell's Hard to Misuse Interface Levels rules (a play on "easy to use").

I've reproduced them here in one place, since I always seem to want to refer to it. See the original for fun examples. Higher is better.

  1. Impossible to get wrong
  2. Compiler/linker won't let you get it wrong
  3. Compiler/linker warns if you get it wrong
  4. Simplest use is correct
  5. The name tells you how to use it
  6. Do it right or breaks at runtime
  7. Follow the convention and you will get it right
  8. Read the documentation and you will get it right
  9. Read the implementation and you will get it right
  10. Read a mail list thread and you will get it right
  11. Read the documentation and you will get it wrong
  12. Follow the convention and you will get it wrong
  13. Do it right and it will break at runtime
  14. The name tells you how not to use it
  15. The obvious use is wrong
  16. Compiler/linker will warn you if you get it right
  17. Compiler/linker won't let you get it right
  18. Impossible to get right

Getting a tick in LaTeX

So you need to show something is good in your LaTeX document, and want a tick (or a cross)? The pifont package gives you access to ZapfDingbats, which has lots of neat characters as shown in the documentation (along with dinglists and other nifty features).

So, after \usepackage{pifont} a quick \newcommand{\tick}{\ding{52}} gives you \tick for all your ticking needs!

The MarvoSym package also has some Winding type things, like laundry instructions (?).

file this under the "easy when you know how but otherwise takes you at least half an hour" category...