Checking users of typedefs

If you poke around the Linux VM layers you might wonder why there is a STRICT_MM_TYPECHECKS option. It is there to stop programmers touching opaque types. Linus has said many times that a typedef should only be made for a completely opaque type. This therefore implies that rather than operate on it directly, there is some sort of "API" to get at the values.

A simple example is below. my_long_t is opaque; the only way you should use its value is via the my_val accessor.

But how do we enforce this, when the type is really just an unsigned long? We wrap it up in something else, in this case a struct.

#ifndef STRICT
  typedef unsigned long my_long_t;
# define my_val(x) (x)

#else
  typedef struct { unsigned long t; } my_long_t;
# define my_val(x) ((x).t)

#endif

my_long_t my_naughty_function(my_long_t input)
{
        return input * 1000;
}

We can now see the outcome

$ gcc -c strict.c
$ gcc -DSTRICT -c strict.c
strict.c: In function 'my_naughty_function':
strict.c:15: error: invalid operands to binary *

So now we have an error where we do the wrong thing, rather than no notification at all! This niftily moves you up from "bug via visual inspection" to "won't compile" (Rusty Russell talked about this scale at linux.conf.au once -- does anyone have a reference to that talk?). Why not just leave it like this then? Well it might confuse the compiler; if you have rules about always passing structures on the stack, for example, the above will suck. In general, it's best to hide as little from the compiler as possible, to give it the best chance of doing a good job for you.

Work for Gelato@UNSW

We are currently looking for a person to join the Gelato@UNSW team. Read the job description; but if you are reading this, and your physical proximity is not too far from UNSW, there is a high proability you'd be interested in the job.

There will be much hacking at userspace and kernel level, though knowing your way around C and building kernels is sufficient. There will be papers involved, but it is not overly academic. You'll have time to learn, and freedom to move things in the best directions you see fit. You'll collaborate with some very smart people both here and abroad; I can speak from personal experience that the positive push of your surroundings will help you learn more than you ever could in other environments.

If you have questions, please email me at ianw@gelato.unsw.edu.au.

Exploring cmp8xchg16

The latest version of Itanium comes with a new instruction, cmp8xchg16. This does a compare and exchange on 8 bytes (64 bits) but exchanges 16 bytes (128 bits). Seems like a strange instruction?

Firstly, a bit of an overview of compare-and-exchange (CAS). As a function, it would be expressed compare_and_exchange(memory, new_value, expected_value). If the memory location contains the expected value, the new value is swapped in, atomically. This means nothing can interrupt us between the compare and the exchange.

This is useful for implementing lock-free algorithms, particularly the case of updating a linked-list stack. For example, below we show how you can safely insert an element into the linked list using compare and exchange.

compare and exchange push

Note that nothing can interrupt the update of the head of list pointer; if the value is incorrect we just loop around, update and try again. Eventually we hope we can do it uninterrupted.

However, removing an element from the list is not as simple.

compare and exchange pop

We could potentially be interrupted, and have the list change underneath us. As long as the elements move around in memory, the CAS will fail and we will be OK. But if the next element is in the same place as the "old" element (the one we are trying to pop) we have a problem. This is known as the "ABA" problem (or the "A-B-A" problem, or something else; if this still doesn't make sense try this page).

To avoid this, one common scheme is to have a version number on the head of list pointer. You do the swap, and compare the version number as well as the next pointer in memory. If they both match, you know that nothing has been updated and you are safe to pop the entry. This requires either a double wide compare and exchange instruction (so you can swap both a pointer and a version number) or tricky schemes like aligning the variables and using the base bits as version numbers (see An Almost Non-Blocking Stack, Bohem 2004).

Now, we can't just check the version with a normal width compare and swap and then update the pointer later, because we've just created our race condition again. But we don't actually need to check the pointer, just swap it in if the version is OK. Hence (I think) a cmp8xchg16 instruction. It allows you to check a version number, swap in a new pointer if it is all OK, but without (what I presume is significant) overhead of a full double wide compare and swap implementation. Just when to update the version I'm still thinking about; I think you could still use a blacklist scheme like the Bohem approach, but it needs further thought.

The Science of Scientific Writing

The Science of Scientific Writing (Gopen and Swan, 1990) is a nice paper for anyone who has to write anything technical.

Since we read English left to right, we come to expect context at the start of our informational blocks (sentence) and data at the end. Messing with this scheme tends to confuse the reader, making them work very hard to understand your point. Some general rules from the paper I thought I might write down:

  • Subjects and verbs should be together. Don't stretch out the sentence with extraneous information between what you are talking about and what you are saying about it. e.g.

    The cache, a 128 entry, dual ported, 4-way set associative design with on core tags but off core data, was found to be the major bottleneck of the system.

    The cache is the subject, and was found to be is the verb. They should not be separated by all the other stuff.

    The cache was a 128 entry, dual ported, 4-way set associative design, with on core tags but off core data. It was found to be the major bottleneck of the system.

  • Similarly, discard material which does not discern significant connections. As in the above example, it is quite easy to get into a habit of putting largely irrelevant information right in the middle of a sentence.

  • The stress position is the area around the end of a sentence. When reading, we tend to want the opening of the sentence to lead us in gently, reaching the big conclusion at the end. We then pause briefly, think about it, and move onto the next sentence. By making sure the stress position stresses the right information, we help understanding.

    10% of time was spent handling TLB refills in our flablah benchmark, which simulates a transaction processing workload.

    The above sentence hits us with the important information first, leaving us to finish the sentence and then wonder "what was that figure again". It could become

    Our flablah benchmark simulates a transaction processing workload. For this benchmark, we found that 10% of time was spent handling TLB refills.

    By keeping worthy material in stress positions we can make things much easier to read.

  • The first part of the sentence, the setup component, is the topic position. One option is to have the topic position linking backwards, so we know what the upcoming stress position is referring to. This is more or less what we did in the example above ("For this benchmark..."). Otherwise we want the topic to introduce the person, thing or concept which we are about to talk about; to quote the paper "Bees disperse pollen" and "pollen is dispersed by bees" are both very similar, but one suggests we are talking about bees and one pollen. You need to make sure you are setting up the sentence so that the emphasis of the stress position relates the foremost concept in the readers mind.

  • Be careful that topic positions keep telling a consistent story. Seems logical, but it is easy to get into situations where you introduce too many ideas without relating them to what has come before.

I'm interested in any other guides to writing people have come across.

Virtual Linear Page Table

Several people asked me to re-explain virtual linear page tables after my SLUG talk. So here it goes!

3 level page table

First, let's review the standard translation mechanism. We have a large address space, which we divide up into pages. We map pages to frames of physical memory.

The processor has a translation look-aside buffer or TLB which stores a cache of virtual page to physical frame mappings. It is small, but fast. Like any cache, it runs out and needs to be filled. The operating system does this from the page tables, which keep a virtual to physical mapping.

On Linux, we use a multi-level hierarchical page table, illustrated above. First you take some of the top bits, and use that as an index into the page global directory (PGD). You then follow that to a page middle directory entry (PMD), take more bits as an index into that page, and finally follow that to page table entries. Finally, you take another index to find the PTE entry, which points to the physical page the given virtual address is mapped to.

So far, so good. But this walking process is very slow. It requires a lot of memory references and hence cache traffic, and has to be performed at regular intervals.

linear page table

What would be good is if we dispensed with the upper levels, as per the above diagram. So we just keep the PTE entries (which point directly to allocated physical pages) in one long array, and then simply use the VPN as an index into it. This would require much less effort.

The problem is that we can not find that many contiguous (next to each other) physical pages. Even if we could, the amount of physical memory we had to allocate to page tables would be huge! So this scheme won't work.

But we have a large virtual address space. Why not dedicate the top portion of that to be a linear array of PTE entries, and use the TLB to map the virtual pages full of PTE's to actual pages full of PTE's.

virtual linear page table

So when we ask for a virtual page, and it is not mapped by the TLB, we can use the virtual page number as an index into our virtually linear page table (note above we mention that we hash the virtual page number, it equates to the same thing as just using it as an index). We can now go directly to the PTE entry, and as long as the virtual page that PTE entry exists in has a mapping to the underlying physical page full of PTE entries, there is no need to walk the tree - the "virtual" PTE is now a pointer to the real thing! From there, it is a simple matter to read the PTE entry and find the physical frame for the virtual page we are looking for.

Of course, as with any engineering effort, there are trade-offs. Firstly, we have to sacrifice some of our virtual address space to the virtual linear table. We have lots of address space, so this is not too severe. The other trade-off is that you now have an extra TLB entry; as we mentioned the TLB is small, so taking up more entries can have a penalising effect. However, if you are walking linearly through pages, as a program usually does, you will not need to do a full page table walk for a whole page full of PTE entries and it only cost you one TLB entry; a significant win.

Itanium attempts to find the hashed PTE address in hardware, and calls is the virtually hashed page table walker or VHPT walker. If it can not resolve the situation, it returns control to the operating system, which will then have to do a full walk of the page tables. There's a few more details, but that's the broad strokes.

I hope that clears things up for a few people ...

... however if it does not, please feel free to modify the figures to make them clearer!

Al Gore on his new movie

Al Gore on his new movie

And you inject some humor into your presentation.
It's hard to believe - I benefit from low expectations.

The Datapoint 2200 - why we're stuck with little endian?

nfd made an interesting point in an architecture seminar recently, which suggested that the reason we are stuck with little endian on most of the world's desktop processors is due to an obscure computer (calculator?) called the Datapoint 2200. It seems Intel was contracted to replace the TTL parts in the Datapoint with a microprocessor, which turned into the 4004 processor. That turned into the 8008 (twice as good), which turned into the 8086, which pretty much brings us to today.

So why little endian? Well, the Datapoint used a shift-register memory, because it was cheap (<grandpa_simpson>as was the style at the time</grandpa_simpson>). AFAICS a shift register memory is just a big FIFO queue; you chain lots of bits together and you've got "memory". You put stuff in one end, take it out the other, and spend lots of time waiting in between!

So the Datapoint can use up to a 16 bit (2 byte) address. Therefore very often you'd want to do an add on a two byte value, and you need to make sure that the carry bit propagated. Now, imagine you use big endian, you have a situation that looks something like

cycling in a shift-register memory

You waste all those cycles shifting through the memory to propagate the carry bit. If you used little endian in this case, you're carry bit would propagate naturally upwards after your first shift out.

So for this reason little endian was the default for the Datapoint; Intel saw fit to copy it for the 4004, and the rest is history. Little endian has some other advantages, but I like this version of the story!

GCC Trampolines

Paul Wayper writes about trampolines. This is something I've come across before, when I was learning about function pointers with IA64.

Nested functions are a strange contraption, and probably best avoided (people who should know agree). For example, here is a completely obfuscated way to double a number.

#include <stdio.h>

int function(int arg) {

        int nested_function(int nested_arg) {
                return arg + nested_arg;
        }

        return nested_function(arg);
}


int main(void)
{
        printf("%d\n", function(10));
}

Let's disassemble that on IA64 (because it is so much easier to understand).

0000000000000000 :
   0:   00 10 15 08 80 05       [MII]       alloc r34=ar.pfs,5,4,0
   6:   c0 80 33 7e 46 60                   adds r12=-16,r12
   c:   04 08 00 84                         mov r35=r1
  10:   01 00 00 00 01 00       [MII]       nop.m 0x0
  16:   10 02 00 62 00 80                   mov r33=b0
  1c:   04 00 01 84                         mov r36=r32;;
  20:   0a 70 40 18 00 21       [MMI]       adds r14=16,r12;;
  26:   00 00 39 20 23 e0                   st4 [r14]=r32
  2c:   01 70 00 84                         mov r15=r14
  30:   10 00 00 00 01 00       [MIB]       nop.m 0x0
  36:   00 00 00 02 00 00                   nop.i 0x0
  3c:   48 00 00 50                         br.call.sptk.many b0=70
  40:   0a 08 00 46 00 21       [MMI]       mov r1=r35;;
  46:   00 00 00 02 00 00                   nop.m 0x0
  4c:   20 02 aa 00                         mov.i ar.pfs=r34
  50:   00 00 00 00 01 00       [MII]       nop.m 0x0
  56:   00 08 05 80 03 80                   mov b0=r33
  5c:   01 61 00 84                         adds r12=16,r12
  60:   11 00 00 00 01 00       [MIB]       nop.m 0x0
  66:   00 00 00 02 00 80                   nop.i 0x0
  6c:   08 00 84 00                         br.ret.sptk.many b0;;

0000000000000070 :
  70:   0a 40 00 1e 10 10       [MMI]       ld4 r8=[r15];;
  76:   80 00 21 00 40 00                   add r8=r32,r8
  7c:   00 00 04 00                         nop.i 0x0
  80:   11 00 00 00 01 00       [MIB]       nop.m 0x0
  86:   00 00 00 02 00 80                   nop.i 0x0
  8c:   08 00 84 00                         br.ret.sptk.many b0;;

Note that nothing funny happens there at all. The nested function looks exactly like a real function, but we don't do any of the alloc or anything to get us a new stack frame -- we're using the stack frame of the old function. No trampoline involved, just a straight branch.

So why do we need a trampoline? Because a nested function has an extra property over a normal function; the stack pointer must be set to be the stack pointer of the function it is nested within. This means that if we were to pass around pointers to the nested function, we would have to keep track that this isn't a pointer to just any old function, but a pointer to special function that needs its stack setup to come from the nested function. C doesn't work like this, there is only one type "pointer to function".

The easiest way is therefore to create a little function that sets the stack to the right place and calls the code. This "trampoline" is a normal function, so no need for trickery. For example, we can modify this to use a function pointer to the nested function:

#include <stdio.h>

int function(int arg) {

        int nested_function(int nested_arg) {
                return arg + nested_arg;
        }

        int (*function_pointer)(int arg) = nested_function;

        return function_pointer(arg);
}


int main(void)
{
        printf("%d\n", function(10));
}

We can see that now we are calling through a function pointer, we go through __ia64_trampoline, which is a little code stub in libgcc.a which loads up the right stack pointer and calls the "nested" function.

function:
    .prologue 12, 33
    .mmb
    .save ar.pfs, r34
    alloc r34 = ar.pfs, 1, 3, 1, 0
    .fframe 48
    adds r12 = -48, r12
    nop 0
    .mmi
    addl r15 = @ltoffx(__ia64_trampoline#), r1
    mov r35 = r1
    .save rp, r33
    mov r33 = b0
    .body
    ;;
    .mmi
    nop 0
    adds r8 = 24, r12
    adds r14 = 16, r12
    .mmi
    ld8.mov r15 = [r15], __ia64_trampoline#
    ;;
    st4 [r14] = r32
    nop 0
    .mmi
    mov r14 = r8
    ;;
    st8 [r14] = r15, 8
    adds r15 = 40, r12
    ;;
    .mfi
    st8 [r14] = r15, 8
    nop 0
    addl r15 = @ltoff(@fptr(nested_function.1814#)), gp
    ;;
    .mmi
    ld8 r15 = [r15]
    ;;
    st8 [r14] = r15, 8
    adds r15 = 16, r12
    ;;
    .mmb
    st8 [r14] = r15
    ld8 r14 = [r8], 8
    nop 0
    .mmi
    ld4 r36 = [r15]
    ;;
    nop 0
    mov b6 = r14
    .mbb
    ld8 r1 = [r8]
    nop 0
    br.call.sptk.many b0 = b6
    ;;
    .mii
    mov r1 = r35
    mov ar.pfs = r34
    mov b0 = r33
    .mib
    nop 0
    .restore sp
    adds r12 = 48, r12
    br.ret.sptk.many b0
    .endp function#
    .section    .rodata.str1.8,"aMS",@progbits,1
    .align 8

So, the summary is that if you're taking the address of a nested function, to avoid having different types of pointers to functions if they are nested or not gcc puts in the stub "trampoline" code.