Getting a tick in LaTeX

So you need to show something is good in your LaTeX document, and want a tick (or a cross)? The pifont package gives you access to ZapfDingbats, which has lots of neat characters as shown in the documentation (along with dinglists and other nifty features).

So, after \usepackage{pifont} a quick \newcommand{\tick}{\ding{52}} gives you \tick for all your ticking needs!

The MarvoSym package also has some Winding type things, like laundry instructions (?).

file this under the "easy when you know how but otherwise takes you at least half an hour" category...

Checking users of typedefs

If you poke around the Linux VM layers you might wonder why there is a STRICT_MM_TYPECHECKS option. It is there to stop programmers touching opaque types. Linus has said many times that a typedef should only be made for a completely opaque type. This therefore implies that rather than operate on it directly, there is some sort of "API" to get at the values.

A simple example is below. my_long_t is opaque; the only way you should use its value is via the my_val accessor.

But how do we enforce this, when the type is really just an unsigned long? We wrap it up in something else, in this case a struct.

#ifndef STRICT
  typedef unsigned long my_long_t;
# define my_val(x) (x)

#else
  typedef struct { unsigned long t; } my_long_t;
# define my_val(x) ((x).t)

#endif

my_long_t my_naughty_function(my_long_t input)
{
        return input * 1000;
}

We can now see the outcome

$ gcc -c strict.c
$ gcc -DSTRICT -c strict.c
strict.c: In function 'my_naughty_function':
strict.c:15: error: invalid operands to binary *

So now we have an error where we do the wrong thing, rather than no notification at all! This niftily moves you up from "bug via visual inspection" to "won't compile" (Rusty Russell talked about this scale at linux.conf.au once -- does anyone have a reference to that talk?). Why not just leave it like this then? Well it might confuse the compiler; if you have rules about always passing structures on the stack, for example, the above will suck. In general, it's best to hide as little from the compiler as possible, to give it the best chance of doing a good job for you.

Work for Gelato@UNSW

We are currently looking for a person to join the Gelato@UNSW team. Read the job description; but if you are reading this, and your physical proximity is not too far from UNSW, there is a high proability you'd be interested in the job.

There will be much hacking at userspace and kernel level, though knowing your way around C and building kernels is sufficient. There will be papers involved, but it is not overly academic. You'll have time to learn, and freedom to move things in the best directions you see fit. You'll collaborate with some very smart people both here and abroad; I can speak from personal experience that the positive push of your surroundings will help you learn more than you ever could in other environments.

If you have questions, please email me at ianw@gelato.unsw.edu.au.

Exploring cmp8xchg16

The latest version of Itanium comes with a new instruction, cmp8xchg16. This does a compare and exchange on 8 bytes (64 bits) but exchanges 16 bytes (128 bits). Seems like a strange instruction?

Firstly, a bit of an overview of compare-and-exchange (CAS). As a function, it would be expressed compare_and_exchange(memory, new_value, expected_value). If the memory location contains the expected value, the new value is swapped in, atomically. This means nothing can interrupt us between the compare and the exchange.

This is useful for implementing lock-free algorithms, particularly the case of updating a linked-list stack. For example, below we show how you can safely insert an element into the linked list using compare and exchange.

compare and exchange push

Note that nothing can interrupt the update of the head of list pointer; if the value is incorrect we just loop around, update and try again. Eventually we hope we can do it uninterrupted.

However, removing an element from the list is not as simple.

compare and exchange pop

We could potentially be interrupted, and have the list change underneath us. As long as the elements move around in memory, the CAS will fail and we will be OK. But if the next element is in the same place as the "old" element (the one we are trying to pop) we have a problem. This is known as the "ABA" problem (or the "A-B-A" problem, or something else; if this still doesn't make sense try this page).

To avoid this, one common scheme is to have a version number on the head of list pointer. You do the swap, and compare the version number as well as the next pointer in memory. If they both match, you know that nothing has been updated and you are safe to pop the entry. This requires either a double wide compare and exchange instruction (so you can swap both a pointer and a version number) or tricky schemes like aligning the variables and using the base bits as version numbers (see An Almost Non-Blocking Stack, Bohem 2004).

Now, we can't just check the version with a normal width compare and swap and then update the pointer later, because we've just created our race condition again. But we don't actually need to check the pointer, just swap it in if the version is OK. Hence (I think) a cmp8xchg16 instruction. It allows you to check a version number, swap in a new pointer if it is all OK, but without (what I presume is significant) overhead of a full double wide compare and swap implementation. Just when to update the version I'm still thinking about; I think you could still use a blacklist scheme like the Bohem approach, but it needs further thought.

The Science of Scientific Writing

The Science of Scientific Writing (Gopen and Swan, 1990) is a nice paper for anyone who has to write anything technical.

Since we read English left to right, we come to expect context at the start of our informational blocks (sentence) and data at the end. Messing with this scheme tends to confuse the reader, making them work very hard to understand your point. Some general rules from the paper I thought I might write down:

  • Subjects and verbs should be together. Don't stretch out the sentence with extraneous information between what you are talking about and what you are saying about it. e.g.

    The cache, a 128 entry, dual ported, 4-way set associative design with on core tags but off core data, was found to be the major bottleneck of the system.

    The cache is the subject, and was found to be is the verb. They should not be separated by all the other stuff.

    The cache was a 128 entry, dual ported, 4-way set associative design, with on core tags but off core data. It was found to be the major bottleneck of the system.

  • Similarly, discard material which does not discern significant connections. As in the above example, it is quite easy to get into a habit of putting largely irrelevant information right in the middle of a sentence.

  • The stress position is the area around the end of a sentence. When reading, we tend to want the opening of the sentence to lead us in gently, reaching the big conclusion at the end. We then pause briefly, think about it, and move onto the next sentence. By making sure the stress position stresses the right information, we help understanding.

    10% of time was spent handling TLB refills in our flablah benchmark, which simulates a transaction processing workload.

    The above sentence hits us with the important information first, leaving us to finish the sentence and then wonder "what was that figure again". It could become

    Our flablah benchmark simulates a transaction processing workload. For this benchmark, we found that 10% of time was spent handling TLB refills.

    By keeping worthy material in stress positions we can make things much easier to read.

  • The first part of the sentence, the setup component, is the topic position. One option is to have the topic position linking backwards, so we know what the upcoming stress position is referring to. This is more or less what we did in the example above ("For this benchmark..."). Otherwise we want the topic to introduce the person, thing or concept which we are about to talk about; to quote the paper "Bees disperse pollen" and "pollen is dispersed by bees" are both very similar, but one suggests we are talking about bees and one pollen. You need to make sure you are setting up the sentence so that the emphasis of the stress position relates the foremost concept in the readers mind.

  • Be careful that topic positions keep telling a consistent story. Seems logical, but it is easy to get into situations where you introduce too many ideas without relating them to what has come before.

I'm interested in any other guides to writing people have come across.

Virtual Linear Page Table

Several people asked me to re-explain virtual linear page tables after my SLUG talk. So here it goes!

3 level page table

First, let's review the standard translation mechanism. We have a large address space, which we divide up into pages. We map pages to frames of physical memory.

The processor has a translation look-aside buffer or TLB which stores a cache of virtual page to physical frame mappings. It is small, but fast. Like any cache, it runs out and needs to be filled. The operating system does this from the page tables, which keep a virtual to physical mapping.

On Linux, we use a multi-level hierarchical page table, illustrated above. First you take some of the top bits, and use that as an index into the page global directory (PGD). You then follow that to a page middle directory entry (PMD), take more bits as an index into that page, and finally follow that to page table entries. Finally, you take another index to find the PTE entry, which points to the physical page the given virtual address is mapped to.

So far, so good. But this walking process is very slow. It requires a lot of memory references and hence cache traffic, and has to be performed at regular intervals.

linear page table

What would be good is if we dispensed with the upper levels, as per the above diagram. So we just keep the PTE entries (which point directly to allocated physical pages) in one long array, and then simply use the VPN as an index into it. This would require much less effort.

The problem is that we can not find that many contiguous (next to each other) physical pages. Even if we could, the amount of physical memory we had to allocate to page tables would be huge! So this scheme won't work.

But we have a large virtual address space. Why not dedicate the top portion of that to be a linear array of PTE entries, and use the TLB to map the virtual pages full of PTE's to actual pages full of PTE's.

virtual linear page table

So when we ask for a virtual page, and it is not mapped by the TLB, we can use the virtual page number as an index into our virtually linear page table (note above we mention that we hash the virtual page number, it equates to the same thing as just using it as an index). We can now go directly to the PTE entry, and as long as the virtual page that PTE entry exists in has a mapping to the underlying physical page full of PTE entries, there is no need to walk the tree - the "virtual" PTE is now a pointer to the real thing! From there, it is a simple matter to read the PTE entry and find the physical frame for the virtual page we are looking for.

Of course, as with any engineering effort, there are trade-offs. Firstly, we have to sacrifice some of our virtual address space to the virtual linear table. We have lots of address space, so this is not too severe. The other trade-off is that you now have an extra TLB entry; as we mentioned the TLB is small, so taking up more entries can have a penalising effect. However, if you are walking linearly through pages, as a program usually does, you will not need to do a full page table walk for a whole page full of PTE entries and it only cost you one TLB entry; a significant win.

Itanium attempts to find the hashed PTE address in hardware, and calls is the virtually hashed page table walker or VHPT walker. If it can not resolve the situation, it returns control to the operating system, which will then have to do a full walk of the page tables. There's a few more details, but that's the broad strokes.

I hope that clears things up for a few people ...

... however if it does not, please feel free to modify the figures to make them clearer!

Al Gore on his new movie

Al Gore on his new movie

And you inject some humor into your presentation.
It's hard to believe - I benefit from low expectations.

The Datapoint 2200 - why we're stuck with little endian?

nfd made an interesting point in an architecture seminar recently, which suggested that the reason we are stuck with little endian on most of the world's desktop processors is due to an obscure computer (calculator?) called the Datapoint 2200. It seems Intel was contracted to replace the TTL parts in the Datapoint with a microprocessor, which turned into the 4004 processor. That turned into the 8008 (twice as good), which turned into the 8086, which pretty much brings us to today.

So why little endian? Well, the Datapoint used a shift-register memory, because it was cheap (<grandpa_simpson>as was the style at the time</grandpa_simpson>). AFAICS a shift register memory is just a big FIFO queue; you chain lots of bits together and you've got "memory". You put stuff in one end, take it out the other, and spend lots of time waiting in between!

So the Datapoint can use up to a 16 bit (2 byte) address. Therefore very often you'd want to do an add on a two byte value, and you need to make sure that the carry bit propagated. Now, imagine you use big endian, you have a situation that looks something like

cycling in a shift-register memory

You waste all those cycles shifting through the memory to propagate the carry bit. If you used little endian in this case, you're carry bit would propagate naturally upwards after your first shift out.

So for this reason little endian was the default for the Datapoint; Intel saw fit to copy it for the 4004, and the rest is history. Little endian has some other advantages, but I like this version of the story!