Scrambled text

Fri 12 October 2007

There's this thing going around on Facebook suggesting you are smart if you can read a paragaph where the middle letters of the words are scrambled.

Everything old is new again, since I remember reading about this in 2003. As far as I remember being able to read the scrambled text was not related to intelligence in any way.

Anyway, as a Friday afternoon distraction I wrote a little javascript to scramble text.

Plain	Scrambled
Hello This is a paragraph that you will still be able to read even after it is scrambled, the only rule being the first and last letters remain in place. This javascript will try and be fairly smart about not moving around non-alphabetic characters too.

Scramble It

Ahh, the good-olde Fisher-Yates shuffle algorithm, friend of the first year tutorial!

POWER5 Graffiti

Thu 04 October 2007

On the bus yesterday I noticed a graffiti etching clearly left by someone employed by IBM. In fact the guy in the row in front had an IBM tee-shirt on, and I suspect he is the culprit, or possibly one of the OzLabs people.

The photo is crap (it's hard to take a photo of a blukhead in the dark from the middle of the back seat on a crowded bus with a phone camera without arousing suspicion) but it clearly says

   I
(heart)
 POWER5

Check out my CSI style image enhancement:

If you're ever on Sydney Buses bus number 2660 it can probably still be seen on the driver side window on the back seat. Maybe what Itanium needs isn't higher clock speeds, lower power, faster FSB, etc., but a concerted tagging effort on public transport around the world!

The stamp idiom with make

Thu 27 September 2007

Sometimes you need direct make to progress through various phases which may not have a particular file target associated with them. For example the first step might be to extract a compressed tarball of data and run a script over it in preparation for compiling that data into something else. Equally you might need to patch some files or maybe download some data. The general idiom to accomplish this sort of thing is the stamp file.

The general principle is to make a phony target that depends on a file called target-stamp. The target-stamp target can then do what it needs to do and simply create a dummy file via touch. This file indicates to make that the phase is complete, and targets that depend on this phase being complete can now depend on the stamp file. A small example follows

all: final

.PHONY: prepare clean

final: prepare
        # run build process

prepare: prepare-phase1-stamp prepare-phase2-stamp

prepare-phase1-stamp:
        # untar data or something
        touch $@

prepare-phase2-stamp: prepare-phase1-stamp
        # run a script or do something else
        touch $@

clean:
        rm -f *-stamp

One place you may notice extensive use of this is the -stamp files in the /debian directory of packages built with CDBS.

Short review of guarded page-tables

Wed 05 September 2007

Anyone who has done an operating systems 101 course is familiar with the standard hierarchical page-table structure. This is essentially a radix-tree where bits of the virtual address are taken as indexes into an array of pointers, each pointing to the next level down or, at the leaves, to a physical entry.

One potential issue with this style of page-table is with a very large, very sparse address space (such as in a single address-space operating system) there will be many intermediate steps with possibly only a single translation at the end (i.e. lots of single pages scattered all around your address-space).

Thus we can add a bit-field guard to each of the entries in the page table (Liedtke, 1994) to help indicate only those paths through the tree which are valid (similar to a Patricia tree).

Translation first proceeds as usual; the top bits are used as an index into the top level table. However, each entry now contains extra information about further valid addresses. The guard is masked against the next bits in the virtual address; if it matches (e.g. in our example is 0x2) the next step can proceed directly to the leaf node, bypassing the middle look-up. The entry can be marked as either final (i.e. pointing to a leaf node) or intermediate, in which case the process continues with the next level (which also has its own guard, and so forth). Any entry that does not match the guard is known to be false, and therefore a fault can be raised.

To examine this further, consider that we know that each page in our 24-bit address space above is 4KiB, thanks to the 2¹² = 4KiB offset. Therefore each of the 16 possible values of the top four bits selects 1MiB of the 2²⁴ = 16MiB address space; e.g. 0x0 selects the first megabyte, 0x1 the second and so forth. The second 4 bits selects a 64KiB region of this 1MiB, and the last 4 bits selects a 4KiB region of this 64KiB region.

This scheme could work quite well if we were using only 16 pages each separated by 1MiB. Each of these pages would only require an entry in the top level with an appropriate guard which could then be marked as final and point directly to a physical frame.

Once things become less sparse however, we need to start filling in middle levels. If you have 2 pages within a 16MiB region, you need at least one extra bit of address to tell them apart. If you have 3 or 4 pages, you need at least 2 bits to tell them apart, and so forth.

It would therefore be beneficial to have variable level sizes to dynamically adapt and create the most efficient lookup possible given the address space layout of a process. For example, if you knew your process was only going map pages 0x123 (100100011) and 0x124 (100100100) a good option would be to make your top-level only two entries (i.e. check the top bit) with a guard on the entry at index 0x1 of 00100. You then need a second level with 8 pointers to differentiate the remaining 3 bits (each of these would be final and hence point directly to a physical page).

The more pages you start mapping, the less bits you can discard with the guard. If you follow this through to a guard of 0 bits you end up with a linear page-table.

It has been shown (Elphinstone, 1999) that fixed level sizes with 16 entries tends to work well. For a 32-bit address-space with 4KiB pages (e.g. 12 bits offset) this leaves 20 bits to be mapped by the page-tables; with each level mapping 2⁴ bits this means a 5 level tree. The guard values provide enough path-compression that the table is not constantly walked. The disadvantage is that it may "over-allocate" meaning more space is dedicated to page-tables than strictly required.

Deciding where and when to split levels, what size indexes to put at each level and updating all the guard values dynamically to make the smallest, most efficient page-table would create a variable-radix page-table (Szmajda and Heiser, 2003).

In summary, a guarded page-table is more useful the more sparse your address space is. A variable-radix guarded page-table is complex, but could offer advantages for implementing variable page-size support and thus having positive effects on TLB coverage.

Fun with `-funroll-loops`

Wed 22 August 2007

I was playing with loop unrolling, and decided to investigate it a little. The basis is this very simple program which looks like it could do with some loop unrolling.

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define SIZE (1500 * 1024 * 1024)

int fn(int *buf) {
        int i;

    for(i=0; i < (SIZE/sizeof(int)); i+=1)
        buf[i] = i;

    return 0;
}

int main(void)
{
        int *buf = malloc(SIZE);
        assert(buf);
    return fn(buf);
}

Firstly, lets have a look at what happens to the function when -funroll-loops is applied. Below is the de-compilation with some comments on what each line is doing.

Non-optimised

                         | nop.m 0x0
r14=0                    | mov r14=r0
save the "loop counter"  | mov.i r2=ar.lc
                         | nop.m 0x0
one less than iterations | movl r15=0x176fffff;;
                         | nop.m 0x0
initalise loop counter   | mov.i ar.lc=r15
                         | nop.i 0x0;;
                         | loop:
*buf = i, buf++          | st4 [r32]=r14,4
i++                      | adds r14=1,r14
dec lc, branch if > 0    | br.cloop.sptk.few .loop
                         | nop.m 0x0
restore loop counter     | mov.i ar.lc=r2

Loop unrolled

                         | nop.m 0x0
r26 = 0                  | mov r26=r0
save lc                  | mov.i r2=ar.lc
                         | nop.m 0x0
iter / 32                | movl r14=0x2edffff;;
                         | nop.m 0x0
setup loop counter       | mov.i ar.lc=r14
                         | nop.i 0x0
                         | loop:
r3 = buf                 | mov r3=r32
r8 = 1                   | adds r8=1,r26
r25 = 3                  | adds r25=3,r26
r23 = buf[3]             | adds r23=12,r32
r24 = 4                  | adds r24=4,r26
r22 = buf[4]             | adds r22=16,r32;;
buf[0] = 0; r3=buf+1     | st4 [r3]=r26,4
r21 = 5                  | adds r21=5,r26
r19 = buf[5]             | adds r19=20,r32
r20 = 6                  | adds r20=6,r26
r18 = buf[6]             | adds r18=24,r32
r16 = 7                  | adds r16=7,r26;;
                         | nop.m 0x0
buf[1] = 1, r3=buf+2     | st4 [r3]=r8,4
r15 = buf[7]             | adds r15=28,r32
r17 = 2                  | adds r17=1,r8
r26 += 8                 | adds r26=8,r26
buf += 8 (for next iter) | adds r32=32,r32;;
buf[2] = 2               | st4 [r3]=r17
buf[3] = 3               | st4 [r23]=r25
                         | nop.i 0x0
buf[4] = 4               | st4 [r22]=r24
buf[5] = 5               | st4 [r19]=r21
                         | nop.i 0x0
buf[6] = 6               | st4 [r18]=r20
buf[7] = 7               | st4 [r15]=r16
loop, dec branch         | br.cloop.sptk.few .loop
                         | nop.m 0x0
restore lc               | mov.i ar.lc=r2
                         | br.ret.sptk.many b0;;

We can see that in the first case the loop counter is setup to run 393216000 times (e.g. 1500*1024*1024/sizeof(int)) and a very simple loop of storing the value and incrementing is undertaken. In the second version, the code becomes more complex. We can see the loop now executes 49152000 times (e.g. 1500*1024*1024/sizeof(int)/8) so we can infer that for each loop iteration, 8 values have been unrolled to be written into our buf array. Indeed, when pulled apart we can see the 8 stores (numbers just reflect the first time through).

So, is it faster?

$ time ./loop

real    0m1.209s
user    0m0.388s
sys     0m0.820s

$ timme ./loop-unroll

real    0m1.056s
user    0m0.244s
sys     0m0.812s

Yes! Great. We can suggest a number of reasons why, but it would be interesting to get some hard statistics on why this is so. Thanks to the Itanium performance counters, we can. Firstly, lets start by seeing how cycles are being spent.

$ pfmon -e cpu_cycles,ia64_inst_retired_this,nops_retired,back_end_bubble_all ./loop
 547923609 CPU_CYCLES
1179995250 IA64_INST_RETIRED_THIS
    104431 NOPS_RETIRED
 154285534 BACK_END_BUBBLE_ALL

$ pfmon -e cpu_cycles,ia64_inst_retired_this,nops_retired,back_end_bubble_all ./loop-unroll
 311570590 CPU_CYCLES
1327451286 IA64_INST_RETIRED_THIS
 147560435 NOPS_RETIRED
  16088889 BACK_END_BUBBLE_ALL

Now, the Itanium can issue up to six instructions (IA64_INST_RETIRED_THIS) in two bundles of three instructions per cycle (CPU_CYCLES). However, you can't have dependencies between registers within a 3-instruction bundle (and branches and interruptions play havoc) so sometimes you need to pad with no-ops (this is why you need a good compiler). We can work out some figures from this, namely the useful instructions per cycle (e.g. instructions retired - nop instructions / cycles).

Test	Useful instructions/cycle
loop	1179995250 - 104431 / 547923609 = 2.15
unroll	1327451286 - 147560435 / 311570590 = 3.78

This tells us that for each given cycle, the unrolled-loop version is doing more work (ideally we'd like that to be a the theoretical maximum of 6). So what is the unrolled version doing for all this time? Bubbles are where the CPU is waiting for something; a prime candidate is moving data from caches to registers. Bubbles make up (154285534/547823609) 28% of the cycles on the loop version, but only (16088889/311570590) 5% of the time for the unrolled version. This means the processor is spending a lot less time waiting for things to happen with the unrolled loop version.

We can drill down further to see what bubbles are occurring.

$ pfmon -e be_l1d_fpu_bubble_all,be_exe_bubble_all,be_RSE_bubble_all,Back_end_bubble_fe ./loop
 14229581 BE_L1D_FPU_BUBBLE_ALL
    97669 BE_EXE_BUBBLE_ALL
     1260 BE_RSE_BUBBLE_ALL
131354866 BACK_END_BUBBLE_FE

$ pfmon -e be_l1d_fpu_bubble_all,be_exe_bubble_all,be_RSE_bubble_all,Back_end_bubble_fe ./loop-unroll
22676169 BE_L1D_FPU_BUBBLE_ALL
   83064 BE_EXE_BUBBLE_ALL
    1263 BE_RSE_BUBBLE_ALL
  718850 BACK_END_BUBBLE_FE

We are now breaking the stalls down to see where they are occuring. L1D_FPU_BUBBLES are essentially related to micro-architectural issues with the caches; things like the L1D cache being overwhelmend and interactions with the hardware page-table walker. BE_EXE bubbles are more simply described a waiting for data to come from caches (or even main memory) to registers. BE_RSE bubbles are related to the register stack engine, and "front end" bubbles (BACK_END_BUBBLE_FE) are where there are not enough instructions coming from the "front-end" for the "back-end" to issue out to functional units in the processor.

In this case, the major cause of problems for the simple loop seems to be not being able to get enough instructions through to keep the processor busy. We can try to pin down what is happening in the front-end:

$ pfmon -e fe_bubble_imiss,fe_bubble_tlbmiss,fe_bubble_branch,fe_bubble_bubble ./loop
   27727 FE_BUBBLE_IMISS
    4910 FE_BUBBLE_TLBMISS
98094189 FE_BUBBLE_BRANCH
52615907 FE_BUBBLE_BUBBLE

$ pfmon -e fe_bubble_imiss,fe_bubble_tlbmiss,fe_bubble_branch,fe_bubble_bubble ./loop-unroll
780729 FE_BUBBLE_IMISS
  4858 FE_BUBBLE_TLBMISS
291620 FE_BUBBLE_BRANCH
111089 FE_BUBBLE_BUBBLE

What we can see here is that the increased code size of the unrolled version causes us a lot more i-cache misses, as expected. However, the problem for the unrolled version seems to be the time spent by the front-end dealing with the branch. This is not surprising since there is a rumor that there is a front-end buffer which fills up after 3 repeated cycles with branch instructions; apparently this is a problem addressed in the new Montecito chips (if anyone has one, I'll update this if you send me the numbers). Since we are doing 393216000 branches all within a single bundle its not surprising we've hit it!

Just to confirm, we can break-down the BE_EXE bubbles. Remember, BE_EXE bubbles are caused by stalls in the execution phase primarily due to cache latencies or inter-register dependencies. Looking at our code, we don't re-use the same register over-and-over so we would not expect to spend a long time waiting for registers, and indeed this is what we see.

$ pfmon -e be_exe_bubble_grall,be_exe_bubble_grgr ./loop
79601 BE_EXE_BUBBLE_GRALL
  919 BE_EXE_BUBBLE_GRGR

$ pfmon -e be_exe_bubble_grall,be_exe_bubble_grgr ./loop-unroll
76627 BE_EXE_BUBBLE_GRALL
  918 BE_EXE_BUBBLE_GRGR

Therefore in this case, the speed increase isn't coming from better cache behaviour as we might have expected, but greater ability to keep the processor busy with instructions between branches.

We can also examine the major events happening with the unrolled case. Here we see BE_L1D_FPU bubbles being the major overhead. On Itanium floating-point avoids the L1 cache (it is too big, and not re-used enough to make it useful) hence the concatenation of the event acronym. These bubbles can be broken down into the following events (measured in two runs, since you can only do 4 at a time).

$ pfmon -e be_l1d_fpu_bubble_l1d,be_l1d_fpu_bubble_l1d_dcurecir,\
  be_l1d_fpu_bubble_l1d_tlb,be_l1d_fpu_bubble_l1d_stbufrecir ./loop-unroll
 9623193 BE_L1D_FPU_BUBBLE_L1D_DCURECIR
     414 BE_L1D_FPU_BUBBLE_L1D_TLB
     282 BE_L1D_FPU_BUBBLE_L1D_STBUFRECIR
$ pfmon -e be_l1d_fpu_bubble_l1d_fullstbuf,\
    be_l1d_fpu_bubble_l1d_l2bpress,be_l1d_fpu_bubble_l1d_tlb ./loop-unroll
      23 BE_L1D_FPU_BUBBLE_L1D_FULLSTBUF
17220177 BE_L1D_FPU_BUBBLE_L1D_L2BPRESS
     425 BE_L1D_FPU_BUBBLE_L1D_TLB

These events expose micro-architectural details, the full extent of which is only gleaned by reading the processor manuals (a PhD in computer architecture probably helps too!). The store buffers are not an issue here -- the caches are more than able to keep up with our few writes. The TLB is also not involved; we have pretty good TLB coverage thanks to the virtual-linear page table and hardware page-table walker.

The two clues are DCURECIR and L2BPRESS. Essentially DCURECIR happens when references have to "re-circulate"; this is a side-effect of the data not being available. The L2BPRESS figure suggests why this is so; this event happens when the L2 cache can not handle any more requests. The L2 cache keeps a queue of outstanding requests; when this queue is full (for example, full of long latency requests to get data from main memory) the L2 cache causes "back-pressure" which stops the L1 cache from issuing new requests. This suggests the program is chewing through a lot of data and not using the cache very effectively, which is exactly what we are doing. Pre-fetching hints can help (but as described in a previous entry on hacking Enigma may not always help), but in this case there probably isn't any spare bandwidth to pre-fetch in because essentially no processing is happening.

What does all this mean? I don't know, but playing with performance monitors is fun!

On incorrectly utilising "utilising"

Tue 21 August 2007

My supervisor pointed out a very annoying habit I had formed when writing technical documents, best summed up below:

"It would be difficult to find a page of governmental, military, or academic writing that doesn't have on it the word utilize. It must be one of the most over-utilized words in the world. It seems as though people out to impress people with the significance of what they're doing use utilize when they should use use.

Utilize is not an elegant variation of the word use; it has its own distinct meaning. When you utilize something, you make do with something not normally used for the purpose, e.g., you utilize a dime when the bloody screwdriver is nowhere to be found. If the screwdriver were there, you'd use it, not utilize a stupid dime for the purpose. Use use when you mean use, and utilize only when it's properly used to mean--to use something not normally used. The computer went off-line, so they utilized Mr. Wang's abacus, the one he liked to use. Despite the temporary breakdown, the computer's use-rate was up (not its utilization-rate)."

Cheney, "Getting the words right" (1983)

You will now probably notice this subtle verbiage in most everything you read!

Function Pointers and Addresses of Functions

Wed 15 August 2007

On many architectures, calling a function does not just mean jumping straight to an address. For example, on Itanium calling a function involves setting a global pointer register before jumping to the function code. The global pointer is used as a known fixed point which can be offset against very quickly. Thus a function is described by two things; the code address and a global pointer value. Unsurprisingly, this information is kept in a function descriptor.

Now, the general process when calling a function from a dynamically linked library is as follows:

Your code calls a stub entry in the Procedure Lookup Table (PLT).
The PLT stub reads the function descriptor (address and GP) to call from the Global Offset Table (GOT).
At first, this will refer to a function in the dynamic linker which will find the library and function you actually want to call, and then patch the GOT function descriptor entry so next time the PLT stub looks at it, you go there directly.

This all works fine. The problem occurs when you try to take the address of a function. On x86 you can just read the function address from the GOT and return it. However, with function descriptors you can not just return the code address -- it is invalid to call the function without the GP set correctly. You can't return the address of your copy of function descriptor, because then what should be the same function doesn't have the same address, which is obviously incorrect.

The Itanium solution is an "official" function descriptor. This is kept by the dynamic linker in private memory, and whenever a function address is requested it will be returned. Consider pretty much the smallest possible example:

$ cat libtest.c
void* function(void)
{
    return function;
}

$ cat test.c
extern void* function(void);

int main(void)
{
        void*(*ptr)(void) = function;
}

$ gcc -shared -o libtest.so ./libtest.c
$ gcc -o test test.c -Wl,-L. -ltest

If we have a look at the relocations in the library we see the following

$ readelf --relocs ./libtest.so

Relocation section '.rela.dyn' at offset 0x440 contains 10 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
[...]
000000010cc0  000b00000047 R_IA64_FPTR64LSB  00000000000007c0 function + 0

hen the dynamic linker is going through the relocations, it responds to any R_IA64_FPTR64LSB relocations by creating a new official function descriptor.

Now we can have a look at the binary

$ readelf --relocs ./test

Relocation section '.rela.dyn' at offset 0x458 contains 3 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
[...]
6000000000000e80  000400000047 R_IA64_FPTR64LSB  0000000000000000 function + 0

This is a bit different, and requires some more investigation. Firstly lets look at the disassembly:

40000000000007c0 <main>:
40000000000007c0:       02 10 00 18 00 21       [MII]       mov r2=r12
40000000000007c6:       e0 40 05 00 48 00                   addl r14=40,r1;;
40000000000007cc:       00 00 04 00                         nop.i 0x0
40000000000007d0:       0a 70 00 1c 18 10       [MMI]       ld8 r14=[r14];;
40000000000007d6:       00 70 08 30 23 80                   st8 [r2]=r14
40000000000007dc:       01 10 00 84                         mov r12=r2
40000000000007e0:       11 00 00 00 01 00       [MIB]       nop.m 0x0
40000000000007e6:       00 00 00 02 00 80                   nop.i 0x0
40000000000007ec:       08 00 84 00                         br.ret.sptk.many b0;;

We know that r1 points to the GOT (in fact, that is the GP value). We see that r14 is being loaded with a value from the GOT which is then de-referenced. Notice the offset of the address, correlating with the section output

There are 41 section headers, starting at offset 0x1690:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
[...]
  [25] .got              PROGBITS         6000000000000e58  00000e58
       0000000000000050  0000000000000000 WAp       0     0     8

Now, 0x6000000000000e80 - 0x6000000000000e58 = 0x28 = 40 which is the offset we are loading from the GOT! So we can deduce that a R_IA64_FPTR64LSB relocation with a symbol value of 0 gets the dynamic linker to setup a pointer to the official function descriptor in the GOT. This is the value returned when you take the address of a function, so now function comparisions work!

Power and PA-RISC are two other examples of architectures that use function descriptors. I think they are generally a feature of architectures who are not register-starved, because you generally pass along a pointer value to global data, which makes offsetting into that global data much faster than looking it up from memory.

What actually happens when you plug in a USB device?

Fri 20 July 2007

Today I started wondering what actually happens when you plug in a USB device. The little tour below goes from starting the USB subsystem to plugging something in, and I hope it is reasonably accurate. This entry is probably best read with a copy of the Linux kernel handy.

Linux USB Core, lower-layers

We can start our tour right at the very bottom in the heart of the USB core.

Things really start at the USB initialisation function in drivers/usb/core/usb.c:usb_init(). The first interesting call is to drivers/base/bus.c:bus_register(). We see that it passes as struct bus_type which looks like:

struct bus_type usb_bus_type = {
        .name =         "usb",
        .match =        usb_device_match,
        .uevent =       usb_uevent,
        .suspend =      usb_suspend,
        .resume =       usb_resume,
};

This is registering a new type of bus with the Linux driver core framework. The bus doesn't have much yet, just a name and some helper functions, but registering a bus sets up the kobject hierarchy that gets exported through /sys/bus/ (/sys/bus/usb in this case) and will allow the further hierarchical building of devices underneath by attaching them as the system runs. This is like the root directory of the USB system.

Your desktop/laptop/palmtop etc has a host controller which directly interfaces to USB devices; common types are UHCI, OHCI and EHCI. The drivers for these various types of controllers live in drivers/usb/host. These controllers are similar but different, so to minimise code duplication Linux has a Host Controller Driver framework (drivers/usb/core/hcd.c) which abstracts most of the common operations from the host controller driver.

The HCD layer does this by keeping a struct usb_hcd (drivers/usb/core/hcd.h) with all common information in it for a host controller. Each of host controller drivers fills out a struct hc_driver for its hardware dependent operations, as per below (taken from the UHCI driver)

static const struct hc_driver uhci_driver = {
        .description =          hcd_name,
        .product_desc =         "UHCI Host Controller",
        .hcd_priv_size =        sizeof(struct uhci_hcd),

        /* Generic hardware linkage */
        .irq =                  uhci_irq,
        .flags =                HCD_USB11,

        /* Basic lifecycle operations */
        .reset =                uhci_init,
        .start =                uhci_start,
#ifdef CONFIG_PM
        .suspend =              uhci_suspend,
        .resume =               uhci_resume,
        .bus_suspend =          uhci_rh_suspend,
        .bus_resume =           uhci_rh_resume,
#endif
        .stop =                 uhci_stop,

        .urb_enqueue =          uhci_urb_enqueue,
        .urb_dequeue =          uhci_urb_dequeue,

        .endpoint_disable =     uhci_hcd_endpoint_disable,
        .get_frame_number =     uhci_hcd_get_frame_number,

        .hub_status_data =      uhci_hub_status_data,
        .hub_control =          uhci_hub_control,
};

USB overview

It might be helpful to clarify a few USB concepts now. A USB device defines a group of end-points, where are grouped together into an interface. An end-point can be either "IN" or "OUT", and sends data in one direction only. End-points can have a number of different types:

Control end-points are for configuring the device, etc.
Interrupt end-points are for transferring small amounts of data. They have higher priority than ...
Bulk end-points, who can transfer more data but do not get guaranteed time constraints.
Isochronous transfers are high-priority real-time transfers, but if they are missed they are not re-tried. This is for streaming data like video or audio where there is no point sending data again.

There can be many interfaces (made of multiple end-points) and interfaces are grouped into "configurations". Most devices only have a single configuration.

You can see how this works at the host controller level with the above diagram clagged from the Intel UHCI documents. The controller has a "frame" register which is incremented every millisecond. Each frame pointer points to a queue of "transfer descriptors". The driver needs to schedule this work so that 90% of the time is given to isochronous data, and 10% left for control data. Should any time remain, the bulk data is transferred. You can see that any transfer descriptor for isochronous data will not be retried, but other data sits in a queue so it is never lost.

The USB layer communicates through USB request blocks, or URBs. A URB contains information about what end-point this request relates to, data, any related information or attributes and a call-back function to be called when the URB is complete. Drivers submit URBs to the USB core, which manages them in co-ordination with the USB host (see the urb_enqueue functions provided by the host driver). Your data gets sent off to the USB device by the USB core, and when its done your call-back is triggered.

Root Hub

There is one more element quite fundamental to the USB core, which is the hub -- all USB devices plug into a hub. The USB controller implements a root hub; you can have multiple root hubs in a machine, and other hubs can then connect to root hubs. The hub driver lives in drivers/usb/core/hub.c. The USB initialisation function starts up the khubd thread (drivers/usb/core/hub.c:usb_hub_init()) which waits for and handles USB events, but we will return to that later.

The hub driver is the first USB driver to be setup, so by examining how that works we can get a feel for how other drivers work.

The hub driver setup starts in drivers/usb/core/usb.c:usb_init() where the drivers/usb/core/hub.c:usb_hub_init() function is called. This calls drivers/usb/core/driver.c:usb_register_driver() which adds itself to the USB bus we mentioned previously. This sets up usb_probe_device() to handle any probe events from the Linux driver core. At this point the hub driver is ready to claim anything that looks like a USB hub.

The root hub setup phase comes out of the HCD setup phase, which proceeds something like this. The Linux driver core goes through all devices on the PCI bus (including the USB host controller of course) and calls the probe() function the device's driver has registered (the host controller registered itself in its initialisation function drivers/usb/core/uhci-hcd.c:uhci_hcd_init()). The USB host controller driver wires up the HCD layer function drivers/usb/core/hcd-pci.c:usb_hcd_pci_probe() to handle this probe (see struct pci_driver uhci_pci_driver in uhci-hcd.c; usb_hcd_pci_probe() does some generic setup, but then calls back into the host driver start() function to do any device specific setup).

usb_hcd_pci_probe() ends up calling drivers/usb/core/hcd.c:usb_add_hcd() which does some generic HCD setup and ends up calling register_root_hub().

register_root_hub() creates a new USB device and registers it with drivers/usb/core/hub.c:usb_new_device(). usb_new_device() first calls drivers/usb/core/config.c:usb_get_configuration which sets up the interface (all hubs only have one interface; the interrupt interface to notify of events on the hub) for the device before registering with the Linux driver core via drivers/base/core/device_add(). device_add() then causes the USB bus to be rescanned.

Binding root hub to a driver

Now we can examine how a USB device gets associated with a driver. To summarise, what needs to happen is the hub driver needs to bind to the host controllers root hub. This illustrates the general concept of a new device binding to a driver.

There is, believe it or not, more layers that come into play now. There is a "generic" USB driver that handles the setup of interfaces in the USB core. As we mentioned earlier a device has a series of end-points grouped together into an interface, and then may have multiple interfaces for different things. Drivers really only care about communicating with the device at the interface level, so the USB core takes care of getting things to this stage for you.

In drivers/usb/core/usb.c:usb_init() the final call is to drivers/usb/core/driver.c:usb_register_device_driver(). This does some simple wrapping of the driver, most importantly setting up usb_probe_device() to handle any probes. It then registers this with Linux driver core with a call to driver_register.

Remembering that drivers/usb/hub.c:usb_new_device() has called device_add(), the driver core will now see this new device and start probing for a driver to handle it. The USB generic driver is going to be called, which has registered drivers/usb/core/driver.c:usb_probe_device() as its probe function. This converts the Linux driver core device back to a USB device (i.e. the USB device that was registered by register_root_hub()) and calls calls the drivers probe function drivers/usb/core/generic.c:generic_probe().

The role of the generic driver is to get the interfaces on the device up and running. Firstly it calls drivers/usb/generic.c:choose_configuration() which simply reads through the device data and chooses a sane configuration. But wait, how does it know what is a sane configuration for the root hub? All the information has been "faked" for the root hub in drivers/usb/core/hcd.c:usb2_rh_dev_descriptor and the like. The root hub details are defined by the USB specification, so these can be kept statically.

Assuming everything is OK, drivers/usb/core/message.c:usb_set_configuration() is called to set up the chosen configuration. It uses the helper function usb_control_message() to send a message to the device about what configuration mode to go into. It then goes through the available interfaces setting up the kernel structures, and adds them with device_add().

Inch by inch, we are getting closer to having the USB system up and running. When a new interface is added, the driver core will now try and find a driver to handle it. When an interface driver is registered with drivers/usb/core/driver.c:usb_register_driver() it sets the probe function to drivers/usb/core/driver.c:usb_probe_interface(). So the driver core calls this function, and the first thing it does is checks the ID against the IDs the driver is happy to handle from the drivers id_table. It uses usb_match_one_id() for this, which can match the class, subclass and protocol to make sure the it will work with this driver.

The root hub is part of the hub class, so can be handled by the hub driver. Therefore drivers/usb/core/hub.c:hub_probe() will be called to get the hub driver to bind with the new hub. This does some sanity checking and calls into hub_configure(). This then does some more general setup, but things get interesting when the interrupt end-point is setup. The interrupt end-point on a hub will send an event whenever something happens on the hub, such as a device being plugged in or unplugged. This is done by creating a URB and binding it to the end-point, asking that hub_irq be called whenever this URB is complete (e.g. when an event is received).

New events on the hub

At this point, the system is waiting for something to happen. The root hub is setup and listening for new events - we are ready to plug in our device.

When this happens the host controller will raise an interrupt signalling that one of its ports has changed state. This will be handled by drivers/usb/host/uhci-hcd.c:uhci_irq(). This checks that the interrupt wasn't due to an error and then calls drivers/usb/host/uhci-q.c:uhci_scan_schedule(). This function implements the interface between the UHCI "transfer data" messages and the Linux URB scheme. It goes through the queues as illustrated in the figure above and finds any complete URBs and calls their completion function.

You may remember that the interrupt end-point of the hub was associated with a URB that would call drivers/usb/core/hub.c:hub_irq(). The UHCI code will go through it's queues, find that this URB is complete and therfore call the completion function.

We previously mentioned that the kernel starts the khubd daemon to handle hub events. We can see that hub_irq does some simple error checking, but its real job is to notify khubd that there is a new event on this hub.

hub_events() is the khubd daemon thread doing all the work. Once notified the hub has new event, it goes and reads the hub status to see what to do. A new device appearing will trigger a port change event, which is handled by hub_port_connect_change(). This does some initial setup of the device, but most importantly now calls usb_new_device().

At this point, if the USB driver for this device is loaded, the device is essentially ready to go. As with the root hub, the device will be probed and its interfaces found with the generic driver, and then those interfaces will be registered and any driver asked if they wish to bind to them.

Loading drivers

The question arises, however, about how modules are dynamically loaded. Keeping every single module for every possible USB device that may plug into the system is clearly sub-optimal, so we wish to load a device driver module only when required.

Most everyone is aware of udev which handles /dev device nodes these days. udev sits around listening for uevents which get sent to it via the kernel. These uevents get sent via netlink, which is like Unix sockets but different. To get uevent messages all you need to do is open a PF_NETLINK socket to the NETLINK_KOBJECT_UEVENT "port", e.g. as the code below extract from udev does.

memset(&snl, 0x00, sizeof(struct sockaddr_nl));
snl.nl_family = AF_NETLINK;
snl.nl_pid = getpid();
snl.nl_groups = 1;

uevent_netlink_sock = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_KOBJECT_UEVENT);
if (uevent_netlink_sock == -1) {
   err("error getting socket: %s", strerror(errno));
   return -1;
}

So we have udev sitting around up in userspace waiting for messages that come from the kernel (as a side note, if /sys/kernel/uevent_helper is filled in with a path that will be run and receive the events with environment variables set; this is useful in early boot before udev has started.

Thus we want to get a message out to udev that there is a new USB interface around, and it should try and load a module to bind to it.

We previously identified drivers/usb/core/message.c:usb_set_configuration() as reading the interfaces for a device and registering them with the driver core. When this registers the interface, it also registers a uevent helper drivers/usb/core/message.c:usb_if_uevent(). This function gets called by the Linux driver core when the driver is added into the driver hierarchy. It adds some information to the uevent:

if (add_uevent_var(envp, num_envp, &i,
       buffer, buffer_size, &length,
       "INTERFACE=%d/%d/%d",
       alt->desc.bInterfaceClass,
       alt->desc.bInterfaceSubClass,
       alt->desc.bInterfaceProtocol))
    return -ENOMEM;

if (add_uevent_var(envp, num_envp, &i,
       buffer, buffer_size, &length,
       "MODALIAS=usb:v%04Xp%04Xd%04Xdc%02Xdsc%02Xdp%02Xic%02Xisc%02Xip%02X",
       le16_to_cpu(usb_dev->descriptor.idVendor),
       le16_to_cpu(usb_dev->descriptor.idProduct),
       le16_to_cpu(usb_dev->descriptor.bcdDevice),
       usb_dev->descriptor.bDeviceClass,
       usb_dev->descriptor.bDeviceSubClass,
       usb_dev->descriptor.bDeviceProtocol,
       alt->desc.bInterfaceClass,
       alt->desc.bInterfaceSubClass,
       alt->desc.bInterfaceProtocol))
    return -ENOMEM;

udev now has all the information required ask modprobe to load a module, if it can. The rule on my system looks something like:

# load the drivers
ENV{MODALIAS}=="?*",    RUN+="/sbin/modprobe --use-blacklist $env{MODALIAS}"

Now, all going well, modprobe loads an appropriate module based on the vendor id, etc. Loading the driver causes a re-scan of the USB bus and the driver will be probed to see if wants to handle any of the unbinded USB devices (using the same procedure as was done setting up the root hub). It should take over control of the device, and that's it your USB device is up and running!

Simple!

Handy resources if you want to know more: Linux Device Drivers, USB2 Specification, UHCI Design Guide.