Compare and Swap with PIC

Fri 01 February 2008

Our Dear Leader Sam Hocevar has previously blogged about PIC and inline ASM. Today I came across a sort of extension to this problem.

Consider the following code, which implements a double word compare and swap using the x86 cmpxchg8b instruction (for a bonus you can lock it to make it atomic).

#include <stdio.h>

typedef struct double_word_t {
    int a;
    int b;
} double_word;

/* atomically compare old and mem, if they are the same then copy new
   back to mem */
int compare_and_swap(double_word *mem,
             double_word old,
             double_word new) {

    char result;
    __asm__ __volatile__("lock; cmpxchg8b %0; setz %1;"
                 : "=m"(*mem), "=q"(result)
                 : "m"(*mem), "d" (old.b), "a" (old.a),
                   "c" (new.b), "b" (new.a)
                 : "memory");
    return (int)result;
}

int main(void)
{

    double_word w = {.a = 0, .b = 0};
    double_word old = {.a = 17, .b = 42};
    double_word new = {.a = 12, .b = 13};

    /* old != w, therefore nothing happens */
    compare_and_swap(&w, old, new);
    printf("Should fail -> (%d,%d)\n", w.a, w.b);

    /* old == w, therefore w = new */
    old.a = 0; old.b = 0;
    compare_and_swap(&w, old, new);
    printf("Should work  -> (%d,%d)\n", w.a, w.b);

    return 0;
}

This type of CAS can be used to implement lock-free algorithms (I've previously blogged about that sort of thing).

The problem is that the cmpxchg8b uses the ebx register, i.e. pseudo code looks like:

if(EDX:EAX == Destination) {
    ZF = 1;
    Destination = ECX:EBX;
}
else {
    ZF = 0;
    EDX:EAX = Destination;
}

PIC code reserves ebx for internal use, so if you try to compile that with -fPIC you will get an error about not being able to allocate ebx.

A first attempt to create a PIC friendly version would simply save and restore ebx and not gcc anything about it, something like:

__asm__ __volatile__("pushl %%ebx;"   /* save ebx used for PIC GOT ptr */
             "movl %6,%%ebx;" /* move new_val2 to %ebx */
             "lock; cmpxchg8b %0; setz %1;"
             "pop %%ebx;"     /* restore %ebx */
                 : "=m"(*mem), "=q"(result)
             : "m"(*mem), "d" (old.b), "a" (old.a),
               "c" (new.b), "m" (new.a) : "memory");

Unfortunately, this isn't a generic solution. It works fine with the PIC case, because gcc will not allocate ebx for anything else. But in the non-PIC case, there is a chance that ebx will be used for addr. This would cause a probably fairly tricky bug to track down!

The solution is to use the #if __PIC__ directive to either tell gcc you're clobbering ebx in the non-PIC case, or just keep two versions around; one that saves and restores ebx for PIC and one that doesn't.

A cool laptop

Tue 29 January 2008

Or, a modern guide to CPU frequency scaling with Linux. Having setup my laptop long ago, I had a strange hybrid of daemons running all trying to scale the frequency of my laptop periodically based on a myriad of different situations. Having decided to fix this, it appears the simplest approach is, as usual, the best.

The way of the future appears to be to let the kernel do all the work with the ondemand govener. The cpufrequtils Debian package will arrange this for you on boot and by default should just work, although you can of course tweak parameters like minimum speed to decrease to and so forth. You can use the cpufreq-set utility to fiddle with settings, or just go into /sys/devices/system/cpu/cpuN/cpufreq/ to tweak them by hand. The many available daemons (cpufreqd, powernowd, etc.) appear to be largely redundant and are probably best removed.

This dropped the temperature of my laptop about 10 degrees C and, apart from a much cooler lap, is so far imperceptible.

5 Second VPN

Mon 21 January 2008

I was recently on a wireless network I trust even less than I usually trust wireless networks, so was looking for a way to ensure a little more security. I've previously setup a PPTP tunnel, but that server is on a boat heading to San Francisco so that was not an option. I have a linode with a generous bandwidth limit, so my first thought was to set that up and route through it with OpenVPN. It started getting a bit complicated and didn't work out-of-the-box with NetworkManager so I gave up.

Then I found that ssh has a really nifty -D option to implement a SOCKS5 proxy. Therefore all I needed to do was ssh -D localhost:8080 remote.box and then setup Firefox to use localhost:8080 as a SOCKS proxy server. But it gets better; I did wonder if my DNS requests were leaking onto the local network, which a quick packet sniff confirmed. It turns out with SOCKS5 all you need to is go to the Firefox about:config page and turn on network.proxy.socks_remote_dns and DNS is tunnelled too.

Since my mail already comes and goes via encrypted channels, this zero maintenance approach pretty much wraps up everything I need from a VPN solution!

24 Technobabble

Thu 03 January 2008

I was watching 24 the other day and (without giving anything away) Edgar Stiles had to "patch a system call" to save the world. After chiding his assistant for giving him the wrong opcode for a "load A, conditional" the following briefly appears on the screen.

It's actually a pretty good piece of technobabble. The register names and use of 0x80 suggest whoever wrote this has at least some familiarity with x86 assembly. It would have been slightly better if they called the "microshell" a "microkernel" and put the system call in eax before calling the interrupt, but if you believe Jack's amazing ability to get cell phone reception then it's only a small issue!

Playing with the x86 PMU

Fri 21 December 2007

A discussion about commas lead to an excuse to have a play with the IA-32 processor performance managment unit (PMU). To start, take two versions of a program to count the number of commas in a text file — one in C and one in Python. The C one runs faster on the input data set of ~60MiB of random data, but why?

The CPU performance monitors give are the key to getting some idea of where the programs spend their time. I like to use perfmon2 because it's what I know, but Oprofile can do it too. All the available events for IA-32 are described in the manual; I currently of no better way of finding out about them than just reading it. On Itanium I reccommend Caliper which, for the most common situations, does most of the work for you and presents it in a nice report. Intel's Vtune also does a similar thing.

The first thing to investigate is if the CPU is getting enough instructions to keep busy. The IFU_MEM_STALL metric is a good place to start as it is triggered when the instruction fetch pipeline is stalled, presumably waiting on either the ITLB or the trace buffer (Icache).

$ pfmon -e CPU_CLK_UNHALTED,IFU_MEM_STALL ./comma < ./randomcommas
340559375 CPU_CLK_UNHALTED
   192115 IFU_MEM_STALL
$ pfmon -e CPU_CLK_UNHALTED,IFU_MEM_STALL python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
1287100
4571257047 CPU_CLK_UNHALTED
 71981750 IFU_MEM_STALL

That works out to 0.05% of total cycles for the C version and 1.5% for the Python version, neither of which sets off immediate warning bells. If it did, we could start drilling down to things like the L2_IFETCH and ITLB_MISS events, or the BR_* branch events to try and see why the CPU is having to wait to get its next instruction.

Next it is useful to find the CPI (cycles per instruction). This is calculated by the ratio of retired instructions against the number of CPU cycles; since a superscalar machine can issue more than one instruction per cycle this should ideally be much greater than 1 (for example, an Itanium can execute up to 6 instructions each cycle).

$ pfmon -e INST_RETIRED,CPU_CLK_UNHALTED ./comma < ./randomcommas
542953593 INST_RETIRED
340612036 CPU_CLK_UNHALTED
$ pfmon -e INST_RETIRED,CPU_CLK_UNHALTED python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
1194455205 INST_RETIRED
4569931735 CPU_CLK_UNHALTED

This works out at a CPI of 1.59 for the C version and 0.26 for the Python version. The Python version is clearly spending a lot of time waiting, because it isn't even able to issue one instruction every cycle.

At this point it seems the CPU has enough instructions to do, but it is sitting around waiting to get through those instructions. This suggests the waiting is related to getting data from the cache.

The load and store requests from the L2 cache are accounted to the L2_LD and L2_ST events respectively. These have the ability to mask out cache lines in different states of the MESI protocol, but for this we don't care so just ask pfmon to show us everything.

$ pfmon -e L2_LD:M:E:S:I,L2_ST:M:E:S:I ./comma < randomcommas
102505 L2_LD:M:E:S:I
   167 L2_ST:M:E:S:I
$ pfmon -e L2_LD:M:E:S:I,L2_ST:M:E:S:I python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
3278774 L2_LD:M:E:S:I
  10457 L2_ST:M:E:S:I

This shows us that the Python version does quite a few more stores than the C counterpart. Considering this program should simply be reading the input stream and counting the number of commas, we do not expect much store traffic at all. This suggests the Python version is doing some extra copying, for whatever reason (maybe some Python expert can pinpoint it?).

We can drill down a bit more into the memory related latencies. The DCU_LINES_IN event gives the total number of lines allocated in the cache. Another event, DCU_MISS_OUTSTANDING, gives a weighted measure of the cycles spent waiting for a cache line to be brought in. Each cycle spent waiting is weighted by the number of outstanding cache misses (I think the Pentium M I'm using can have up to 4 cache miss requests outstanding at once) and has some caveats, but can be considered a rough estimate of the time spent waiting for a cache line to be brought in. Therefore divding DCU_MISS_OUTSTANDING by DCU_LINES_IN gives us an approximate metric of how long a cache miss takes.

$ pfmon -e DCU_MISS_OUTSTANDING,DCU_LINES_IN ./comma < randomcommas
769736 DCU_MISS_OUTSTANDING
102387 DCU_LINES_IN

$ pfmon -e DCU_MISS_OUTSTANDING,DCU_LINES_IN python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
99810150 DCU_MISS_OUTSTANDING
 4240179 DCU_LINES_IN

So that works out to 7.5 cycles for the C version and 23 cycles for the Python version. This seems to strongly suggest that it is memory traffic that is weighing the Python version down.

That is only the initial phase; the results give a high level idea of why one program is running slower than the other. The initial analysis phase generally consists of taking the ratios of certain events to try and get some idea in your head of what the program is doing. Then comes the really hard work; drilling down to figure out how to fix it!

Some useful references:

Truth in names

Thu 15 November 2007

What is it with the price of cables? Even leaving aside patently absurd enhancements such as gold plated optical connectors, who is buying all these ridiculously overpriced Monster cables?

Thus when purchasing a new cable today it was a pleasant surprise to find a cable company who are happy to state their business model prominently in their name: ConCord. The only thing that would have made it better was if their tag line was "we're screwing you for this cable, but at least we're telling you straight up".

ConCord; we're screwing you for this cable, but at least we tell you straight up

It reminded me of the apt rename of the UNSW student centre to the NewSouth Q, which I assume was done to more accurately describe its function as a gigantic queue (though they might as well have gone the whole way and called it the "NewSouth QForAnHourAndWellSayWeCantHelpYouAndYouNeedToStandInThisOtherQ").

Shared memory on Linux

Thu 08 November 2007

You may have noticed a range of things mounted on a tmpfs file system.

$ mount | grep tmpfs
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)

tmpfs is actually more pervasive than it might appear. It was created to build a filesystem interface ontop of shared pages of RAM. This is useful because there are a number of ways to access shared RAM; notably SysV IPC Shared Memory (shmget, shmat, etc) and mmap of anonymous memory (MAP_ANONYMOUS).

In order to have a consistent infrastructure implementing all of this the kernel abstracts RAM as a filesystem and your mapping as a file. To this end the tmpfs filesystem was created to present this abstraction. It deals with issues such as caching, security and the ability to swap pages in a consistent manner (i.e. using the same infrastructure every other file system does).

Overall, tmpfs is implemented in mm/shmem.c. In the initialisation function the file system is mounted internally by the kernel with vfs_kern_mount so it is always around. It gets fairly complicated because it has to keep track of swapped out pages and the general complexity that come with implementing a full filesystem.

As described above, there are several distinct ways to get access to this file system.

The SysV IPC shared memory layer (ipc/shm.c) is a wrapper layer which manages IPC keys and internally plugs itself into the tmpfs code. Essentially creating a key with shmget() ends up calling ipc/shmem.c:newseg() which calls mm/shmem.c:shmem_file_setup() which then creates a struct file for the shared memory. This new file reuses the file operations from the tmpfs filesystem, but is not actually put into the file system (i.e. linked to the superblock). When shat() is called to attach to this memory the SysV IPC layer will internally translate this into a mmap of this file. This is what is used by the Shared Global Area of Oracle and some Java implementations, amongst many other things (see /proc/sysvipc/shm for a comprehensive list).
The second common thing to do is an anonymous mmap. By default MAP_ANONYMOUS ignores the file-descriptor, but an analagous methodology is to explicitly map /dev/zero. If you look at the implementation of /dev/zero in drivers/char/mem.c you can see code such as
```
static int mmap_zero(struct file * file, struct vm_area_struct * vma)
{
        int err;

        if (vma->vm_flags & VM_SHARED)
                return shmem_zero_setup(vma);
```
which ends up backing shared areas with tmpfs.
The third way is actually mounting tmpfs and using it from userspace as a normal file system. glibc implements the POSIX shared memory call shm_open by creating and mmaping files on a tmpsfs mount (by default looking in /dev/shm). Other utilities also use the tmpfs semantics and mount it at other points; the obvious one of course being mounting /tmp on it!

tmpfs: more than meets the eye!

Update: hopefully a little clearer, thanks to Helge Bahmann for comments.

Non-thread safe C

Mon 29 October 2007

Recently it has come up on the LKML and other blogs about how C is not thread safe. The nice piece of example code given is

int trylock()
{
    int res;

    res = pthread_mutex_trylock(&mutex);
    if (res == 0)
            ++acquires_count;

    return res;
}

At first glance it seems OK, and I wasn't exactly sure what the problem was until I pulled it apart. GCC is well within its rights to produce the following code (annotated for clarity).

push   %ebp                     ; setting up stack frame
mov    %esp,%ebp                ; save esp to ebp
sub    $0x8,%esp                ; make some room on stack
movl   <mutex>,(%esp)           ; copy mutex to stack
call   <pthread_mutex_trylock>  ; call pthread_mutex_trylock
mov    <acquires_count>,%edx    ; copy acquires_count to %edx
cmp    $0x1,%eax                ; set carry flag if result == 0
adc    $0x0,%edx                ; add with carry
mov    %edx,<acquires_count>    ; write back acquires_count
leave                           ; release stack frame
ret                             ; done!

This code uses a bit of a trick to avoid a branch around code. It uses the carry flag, which is simple enough to understand when adding. Consider if you are adding 0001 + 1111 you end up with 11111, where the top bit is the carry bit. You can then feed this carry bit into another part of the addition operation if you were, say, using two 32 bit numbers to represent a 64 bit number on a 32 bit system.

The trick uses the dual nature of the carry flag as a "borrow" bit; if a < b when a - b then the bit is set. Consider the simplest case of 0 - 1 -- the carry bit is set to indicate to the next operation that a bit needs to be borrowed. Imagine for a moment we have a four bit machine and want to do the 8-bit unsigned operation

1111 0000 -   (240) -
0000 0001     (  1)
==== ====

First we do 0000 - 0001 which leaves us 1111 with the borrow bit set. Then we do 1111 - 0000 but take note that the borrow bit is set, so get rid of the borrowed bit, leaving us 1110. Putting the two together gives us 1110 1111 or 239.

Therefore the cmp $0x1,<result> instruction (which is really just result - 1 with no output but setting flags) can only set the carry flag (i.e. borrow a bit) if the value is 0; any other value in result means there will be no need to borrow and the carry flag is not set.

Consequently the add with carry (adc) will add one to the result only if the carry bit was set. A neat trick which avoids a jump, but not so much thread safe: if you are interrupted after the cmp but before the mov store you can write back the wrong value. Most people (including me till 5 minutes ago) looking at the code would assume that it was safe on the theory that acquires_count is not touched if you don't have the lock because it is inside the if statement.

Interestingly, the Intel compiler does the simpler jump method

push   <mutex>                     ; push argument
call   <pthread_mutex_trylock>
pop    %ecx                        ; get return addr on top
test   %eax,%eax                   ; AND %eax,%eax
jne    <trylock+0x16>  --------+   ; if that wasn't 0, jump out
addl   $0x1,<acquires_count>   |
ret                    <-------+   ; done!

Frankly I have no idea which one is faster/better (I would guess the second one because you don't always dirty a cache line, but you'd have to measure that against the cost of the branch) but obviously GCC thinks one thing and Intel thinks another and I bet if you put them in a room together you would end up with 3 opinions! I think the major point is that once upon a time it was mostly people writing kernels doing things like accessing I/O ports or other memory mapped areas that needed to be very careful the compiler didn't optimise out (or in) something incorrectly. With multi-threaded C code that problem moves up to the application developer too.

A different instruction set can help here. The Itanium version is possibly less prone to this sort of thing because predication makes it faster to implement conditional branches -- any instruction with a predicate that is not set is skipped. The load and add is safe because it will only be done with the lock held. However, if anyone is wondering why Itanium come with such big caches, the difference between the 6 line version and the Itanium version below should answer that.

[MII]       alloc r32=ar.pfs,4,3,0               ; alloc stack frame
            mov r33=b0                           ; return pointer
            mov r34=r1
[MMI]       addl r14=72,r1;;                     ; r11 = &mutex from GOT
            ld8 r11=[r14]                        ;
            nop.i 0x0;;
[MIB]       mov r35=r11                          ; put into argument 1
            nop.i 0x0
            br.call.sptk.few b0=<pthread_mutex_trylock>
[MII]       cmp4.eq.unc p0,p6=r8,r0              ; if result !0, set predicate p6
            mov.i ar.pfs=r32
            mov r1=r34;;                         ; load acquires_count from GOT
[MIB]       addl r10=144,r1                      ;
            nop.i 0x0
      (p06) br.cond.dptk.few <trylock+0x80>;;    ; jump to [1] if p6 set
[MII]       ld4 r9=[r10]
            mov b0=r33
            addl r3=144,r1;;
[MMI]       adds r2=1,r9;;                       ; acquires_count++
            st4 [r3]=r2                          ; save updated value
            nop.i 0x0
[MIB]       nop.m 0x0                            ; [1]
            nop.i 0x0
            br.few <trylock+0x90>;;              ; jump to [2]
[MII]       nop.m 0x0
            mov b0=r33
            nop.i 0x0;;
[MIB]       nop.m 0x0                            ; [2]
            nop.i 0x0
            br.ret.sptk.many b0;;

I guess the moral of the story is that multi-threading is hard, but we knew that anyway.