Spot the bug!

See if you can spot the bug in this code!

#include <stdio.h>
#include <stdlib.h>

int main(void)
{
    union {
        unsigned char raw[8];
        struct {
            char one;
            int two;
            char three;
            char four;
            char five;
        } formatted __attribute__((packed));
    } test;

    printf("one   : %d\n", (int)&test.formatted.one - (int)&test);
    printf("two   : %d\n", (int)&test.formatted.two - (int)&test);
    printf("three : %d\n", (int)&test.formatted.three - (int)&test);
    printf("four  : %d\n", (int)&test.formatted.four - (int)&test);
    printf("five  : %d\n", (int)&test.formatted.five - (int)&test);
    return 0;
}
$ gcc -Wall  -o packing packing.c
$ ./packing
one   : 0
two   : 4
three : 8
four  : 9
five  : 10

Here's the relevant bit from the gcc manual:

For an enum, struct or union type, you may specify attributes either between the enum, struct or union tag and the name of the type, or just past the closing curly brace of the definition. The former syntax is preferred.

By getting the packed attribute in the wrong place (if it's not clear, it should be before formatted), it is applied to the variable rather than the type. The compiler (both gcc and icc do this) has already laid out the structure, so misses the packing directive, and unfortunately doesn't warn (that may be a bug?).

I can tell you from experience this can be hard to track down!

Non-thread safe C

Recently it has come up on the LKML and other blogs about how C is not thread safe. The nice piece of example code given is

int trylock()
{
    int res;

    res = pthread_mutex_trylock(&mutex);
    if (res == 0)
            ++acquires_count;

    return res;
}

At first glance it seems OK, and I wasn't exactly sure what the problem was until I pulled it apart. GCC is well within its rights to produce the following code (annotated for clarity).

push   %ebp                     ; setting up stack frame
mov    %esp,%ebp                ; save esp to ebp
sub    $0x8,%esp                ; make some room on stack
movl   <mutex>,(%esp)           ; copy mutex to stack
call   <pthread_mutex_trylock>  ; call pthread_mutex_trylock
mov    <acquires_count>,%edx    ; copy acquires_count to %edx
cmp    $0x1,%eax                ; set carry flag if result == 0
adc    $0x0,%edx                ; add with carry
mov    %edx,<acquires_count>    ; write back acquires_count
leave                           ; release stack frame
ret                             ; done!

This code uses a bit of a trick to avoid a branch around code. It uses the carry flag, which is simple enough to understand when adding. Consider if you are adding 0001 + 1111 you end up with 11111, where the top bit is the carry bit. You can then feed this carry bit into another part of the addition operation if you were, say, using two 32 bit numbers to represent a 64 bit number on a 32 bit system.

The trick uses the dual nature of the carry flag as a "borrow" bit; if a < b when a - b then the bit is set. Consider the simplest case of 0 - 1 -- the carry bit is set to indicate to the next operation that a bit needs to be borrowed. Imagine for a moment we have a four bit machine and want to do the 8-bit unsigned operation

1111 0000 -   (240) -
0000 0001     (  1)
==== ====

First we do 0000 - 0001 which leaves us 1111 with the borrow bit set. Then we do 1111 - 0000 but take note that the borrow bit is set, so get rid of the borrowed bit, leaving us 1110. Putting the two together gives us 1110 1111 or 239.

Therefore the cmp $0x1,<result> instruction (which is really just result - 1 with no output but setting flags) can only set the carry flag (i.e. borrow a bit) if the value is 0; any other value in result means there will be no need to borrow and the carry flag is not set.

Consequently the add with carry (adc) will add one to the result only if the carry bit was set. A neat trick which avoids a jump, but not so much thread safe: if you are interrupted after the cmp but before the mov store you can write back the wrong value. Most people (including me till 5 minutes ago) looking at the code would assume that it was safe on the theory that acquires_count is not touched if you don't have the lock because it is inside the if statement.

Interestingly, the Intel compiler does the simpler jump method

push   <mutex>                     ; push argument
call   <pthread_mutex_trylock>
pop    %ecx                        ; get return addr on top
test   %eax,%eax                   ; AND %eax,%eax
jne    <trylock+0x16>  --------+   ; if that wasn't 0, jump out
addl   $0x1,<acquires_count>   |
ret                    <-------+   ; done!

Frankly I have no idea which one is faster/better (I would guess the second one because you don't always dirty a cache line, but you'd have to measure that against the cost of the branch) but obviously GCC thinks one thing and Intel thinks another and I bet if you put them in a room together you would end up with 3 opinions! I think the major point is that once upon a time it was mostly people writing kernels doing things like accessing I/O ports or other memory mapped areas that needed to be very careful the compiler didn't optimise out (or in) something incorrectly. With multi-threaded C code that problem moves up to the application developer too.

A different instruction set can help here. The Itanium version is possibly less prone to this sort of thing because predication makes it faster to implement conditional branches -- any instruction with a predicate that is not set is skipped. The load and add is safe because it will only be done with the lock held. However, if anyone is wondering why Itanium come with such big caches, the difference between the 6 line version and the Itanium version below should answer that.

[MII]       alloc r32=ar.pfs,4,3,0               ; alloc stack frame
            mov r33=b0                           ; return pointer
            mov r34=r1
[MMI]       addl r14=72,r1;;                     ; r11 = &mutex from GOT
            ld8 r11=[r14]                        ;
            nop.i 0x0;;
[MIB]       mov r35=r11                          ; put into argument 1
            nop.i 0x0
            br.call.sptk.few b0=<pthread_mutex_trylock>
[MII]       cmp4.eq.unc p0,p6=r8,r0              ; if result !0, set predicate p6
            mov.i ar.pfs=r32
            mov r1=r34;;                         ; load acquires_count from GOT
[MIB]       addl r10=144,r1                      ;
            nop.i 0x0
      (p06) br.cond.dptk.few <trylock+0x80>;;    ; jump to [1] if p6 set
[MII]       ld4 r9=[r10]
            mov b0=r33
            addl r3=144,r1;;
[MMI]       adds r2=1,r9;;                       ; acquires_count++
            st4 [r3]=r2                          ; save updated value
            nop.i 0x0
[MIB]       nop.m 0x0                            ; [1]
            nop.i 0x0
            br.few <trylock+0x90>;;              ; jump to [2]
[MII]       nop.m 0x0
            mov b0=r33
            nop.i 0x0;;
[MIB]       nop.m 0x0                            ; [2]
            nop.i 0x0
            br.ret.sptk.many b0;;

I guess the moral of the story is that multi-threading is hard, but we knew that anyway.

On incorrectly utilising "utilising"

My supervisor pointed out a very annoying habit I had formed when writing technical documents, best summed up below:

"It would be difficult to find a page of governmental, military, or academic writing that doesn't have on it the word utilize. It must be one of the most over-utilized words in the world. It seems as though people out to impress people with the significance of what they're doing use utilize when they should use use.

Utilize is not an elegant variation of the word use; it has its own distinct meaning. When you utilize something, you make do with something not normally used for the purpose, e.g., you utilize a dime when the bloody screwdriver is nowhere to be found. If the screwdriver were there, you'd use it, not utilize a stupid dime for the purpose. Use use when you mean use, and utilize only when it's properly used to mean--to use something not normally used. The computer went off-line, so they utilized Mr. Wang's abacus, the one he liked to use. Despite the temporary breakdown, the computer's use-rate was up (not its utilization-rate)."

Cheney, "Getting the words right" (1983)

You will now probably notice this subtle verbiage in most everything you read!

Software Engineering via Balkanisation

We were talking today about the relative ease of moving Darwin (the core of OS X) to L4 the Darbat guys have experienced. Chuck noted that major parts of IOKit, Mach and BSD pull apart more easily than one might expect.

Luckily we had an ex Apple IOKit architect there, who noted that one large factor behind the good software engineering arose because none of those three groups ever talked to each other, sometimes even to the point of "I'm a BSD guy, I don't talk to the IOKit guys".

It got me to thinking that high impedance between groups working on tightly integrated components is probably a good thing. A good example might be how none of the projects to port Linux on top of L4 have ever really stuck and fallen to bit-rot from non-maintenance (of course Wombat does not fall into this category! :).

Maybe the problem is that the Linux community talks too much, and this leads developers to becoming too familiar and too willing to cross domain boundaries (this has been studied and shown to be a problem in papers such as Maintainability of the Linux Kernel, Scach et al.).

Of course prominent kernel developers are not keen on committing to APIs for various (good) reasons. Clearly a stable ABI is unworkable, and perhaps the impedance levels for making core changes are already quite high because you've got to get it past a raft of maintainers before it goes anywhere. But alternatively perhaps everyone giving their 2c's worth and trying to work together to couple their work as closely as possible causes more boundary crossings than otherwise might happen. "Don't do that there, we'll tweak this subsystem to do that and move that bit up a level behind a new interface" etc. -- you want systems to be close, but not so close that they can never be untangled again.

So, is making your teams hate each other a valid software engineering approach?

Rusty Russell's "Hard to Misuse" rules

A helpful comment pointed me to Rusty Russell's Hard to Misuse Interface Levels rules (a play on "easy to use").

I've reproduced them here in one place, since I always seem to want to refer to it. See the original for fun examples. Higher is better.

  1. Impossible to get wrong
  2. Compiler/linker won't let you get it wrong
  3. Compiler/linker warns if you get it wrong
  4. Simplest use is correct
  5. The name tells you how to use it
  6. Do it right or breaks at runtime
  7. Follow the convention and you will get it right
  8. Read the documentation and you will get it right
  9. Read the implementation and you will get it right
  10. Read a mail list thread and you will get it right
  11. Read the documentation and you will get it wrong
  12. Follow the convention and you will get it wrong
  13. Do it right and it will break at runtime
  14. The name tells you how not to use it
  15. The obvious use is wrong
  16. Compiler/linker will warn you if you get it right
  17. Compiler/linker won't let you get it right
  18. Impossible to get right

The Science of Scientific Writing

The Science of Scientific Writing (Gopen and Swan, 1990) is a nice paper for anyone who has to write anything technical.

Since we read English left to right, we come to expect context at the start of our informational blocks (sentence) and data at the end. Messing with this scheme tends to confuse the reader, making them work very hard to understand your point. Some general rules from the paper I thought I might write down:

  • Subjects and verbs should be together. Don't stretch out the sentence with extraneous information between what you are talking about and what you are saying about it. e.g.

    The cache, a 128 entry, dual ported, 4-way set associative design with on core tags but off core data, was found to be the major bottleneck of the system.

    The cache is the subject, and was found to be is the verb. They should not be separated by all the other stuff.

    The cache was a 128 entry, dual ported, 4-way set associative design, with on core tags but off core data. It was found to be the major bottleneck of the system.

  • Similarly, discard material which does not discern significant connections. As in the above example, it is quite easy to get into a habit of putting largely irrelevant information right in the middle of a sentence.

  • The stress position is the area around the end of a sentence. When reading, we tend to want the opening of the sentence to lead us in gently, reaching the big conclusion at the end. We then pause briefly, think about it, and move onto the next sentence. By making sure the stress position stresses the right information, we help understanding.

    10% of time was spent handling TLB refills in our flablah benchmark, which simulates a transaction processing workload.

    The above sentence hits us with the important information first, leaving us to finish the sentence and then wonder "what was that figure again". It could become

    Our flablah benchmark simulates a transaction processing workload. For this benchmark, we found that 10% of time was spent handling TLB refills.

    By keeping worthy material in stress positions we can make things much easier to read.

  • The first part of the sentence, the setup component, is the topic position. One option is to have the topic position linking backwards, so we know what the upcoming stress position is referring to. This is more or less what we did in the example above ("For this benchmark..."). Otherwise we want the topic to introduce the person, thing or concept which we are about to talk about; to quote the paper "Bees disperse pollen" and "pollen is dispersed by bees" are both very similar, but one suggests we are talking about bees and one pollen. You need to make sure you are setting up the sentence so that the emphasis of the stress position relates the foremost concept in the readers mind.

  • Be careful that topic positions keep telling a consistent story. Seems logical, but it is easy to get into situations where you introduce too many ideas without relating them to what has come before.

I'm interested in any other guides to writing people have come across.

The Datapoint 2200 - why we're stuck with little endian?

nfd made an interesting point in an architecture seminar recently, which suggested that the reason we are stuck with little endian on most of the world's desktop processors is due to an obscure computer (calculator?) called the Datapoint 2200. It seems Intel was contracted to replace the TTL parts in the Datapoint with a microprocessor, which turned into the 4004 processor. That turned into the 8008 (twice as good), which turned into the 8086, which pretty much brings us to today.

So why little endian? Well, the Datapoint used a shift-register memory, because it was cheap (<grandpa_simpson>as was the style at the time</grandpa_simpson>). AFAICS a shift register memory is just a big FIFO queue; you chain lots of bits together and you've got "memory". You put stuff in one end, take it out the other, and spend lots of time waiting in between!

So the Datapoint can use up to a 16 bit (2 byte) address. Therefore very often you'd want to do an add on a two byte value, and you need to make sure that the carry bit propagated. Now, imagine you use big endian, you have a situation that looks something like

cycling in a shift-register memory

You waste all those cycles shifting through the memory to propagate the carry bit. If you used little endian in this case, you're carry bit would propagate naturally upwards after your first shift out.

So for this reason little endian was the default for the Datapoint; Intel saw fit to copy it for the 4004, and the rest is history. Little endian has some other advantages, but I like this version of the story!

How to annoy customers looking to spend multiple thousands of dollars

Allow them to complete the entire order process, and then present them with the following screen at the last step. Continue doing this for a few hours.

We're sorry - We are experiencing some technical problem

Clearly you shouldn't run any business critical systems on Dell and certainly don't employ anyone who lists "Dell Website Developer" on their resume.