Recently it has come up on the LKML and other
blogs about how C is not thread
safe. The nice piece of example code given is
int trylock()
{
int res;
res = pthread_mutex_trylock(&mutex);
if (res == 0)
++acquires_count;
return res;
}
At first glance it seems OK, and I wasn't exactly sure what the problem
was until I pulled it apart. GCC is well within its rights to produce
the following code (annotated for clarity).
push %ebp ; setting up stack frame
mov %esp,%ebp ; save esp to ebp
sub $0x8,%esp ; make some room on stack
movl <mutex>,(%esp) ; copy mutex to stack
call <pthread_mutex_trylock> ; call pthread_mutex_trylock
mov <acquires_count>,%edx ; copy acquires_count to %edx
cmp $0x1,%eax ; set carry flag if result == 0
adc $0x0,%edx ; add with carry
mov %edx,<acquires_count> ; write back acquires_count
leave ; release stack frame
ret ; done!
This code uses a bit of a trick to avoid a branch around code. It uses
the carry flag, which is simple enough to understand when adding.
Consider if you are adding 0001 + 1111 you end up with 11111,
where the top bit is the carry bit. You can then feed this carry bit
into another part of the addition operation if you were, say, using two
32 bit numbers to represent a 64 bit number on a 32 bit system.
The trick uses the dual nature of the carry flag as a "borrow" bit; if
a < b when a - b then the bit is set. Consider the simplest case of
0 - 1 -- the carry bit is set to indicate to the next operation that
a bit needs to be borrowed. Imagine for a moment we have a four bit
machine and want to do the 8-bit unsigned operation
1111 0000 - (240) -
0000 0001 ( 1)
==== ====
First we do 0000 - 0001 which leaves us 1111 with the borrow bit set.
Then we do 1111 - 0000 but take note that the borrow bit is set, so get
rid of the borrowed bit, leaving us 1110. Putting the two together gives
us 1110 1111 or 239.
Therefore the cmp $0x1,<result> instruction (which is really just
result - 1 with no output but setting flags) can only set the carry
flag (i.e. borrow a bit) if the value is 0; any other value in result
means there will be no need to borrow and the carry flag is not set.
Consequently the add with carry (adc) will add one to the result
only if the carry bit was set. A neat trick which avoids a jump, but not
so much thread safe: if you are interrupted after the cmp but before
the mov store you can write back the wrong value. Most people
(including me till 5 minutes ago) looking at the code would assume that
it was safe on the theory that acquires_count is not touched if you
don't have the lock because it is inside the if statement.
Interestingly, the Intel compiler does the simpler jump method
push <mutex> ; push argument
call <pthread_mutex_trylock>
pop %ecx ; get return addr on top
test %eax,%eax ; AND %eax,%eax
jne <trylock+0x16> --------+ ; if that wasn't 0, jump out
addl $0x1,<acquires_count> |
ret <-------+ ; done!
Frankly I have no idea which one is faster/better (I would guess the
second one because you don't always dirty a cache line, but you'd have
to measure that against the cost of the branch) but obviously GCC thinks
one thing and Intel thinks another and I bet if you put them in a room
together you would end up with 3 opinions! I think the major point is
that once upon a time it was mostly people writing kernels doing things
like accessing I/O ports or other memory mapped areas that needed to be
very careful the compiler didn't optimise out (or in) something
incorrectly. With multi-threaded C code that problem moves up to the
application developer too.
A different instruction set can help here. The Itanium version is
possibly less prone to this sort of thing because predication makes it
faster to implement conditional branches -- any instruction with a
predicate that is not set is skipped. The load and add is safe because
it will only be done with the lock held. However, if anyone is wondering
why Itanium come with such big caches, the difference between the 6 line
version and the Itanium version below should answer that.
[MII] alloc r32=ar.pfs,4,3,0 ; alloc stack frame
mov r33=b0 ; return pointer
mov r34=r1
[MMI] addl r14=72,r1;; ; r11 = &mutex from GOT
ld8 r11=[r14] ;
nop.i 0x0;;
[MIB] mov r35=r11 ; put into argument 1
nop.i 0x0
br.call.sptk.few b0=<pthread_mutex_trylock>
[MII] cmp4.eq.unc p0,p6=r8,r0 ; if result !0, set predicate p6
mov.i ar.pfs=r32
mov r1=r34;; ; load acquires_count from GOT
[MIB] addl r10=144,r1 ;
nop.i 0x0
(p06) br.cond.dptk.few <trylock+0x80>;; ; jump to [1] if p6 set
[MII] ld4 r9=[r10]
mov b0=r33
addl r3=144,r1;;
[MMI] adds r2=1,r9;; ; acquires_count++
st4 [r3]=r2 ; save updated value
nop.i 0x0
[MIB] nop.m 0x0 ; [1]
nop.i 0x0
br.few <trylock+0x90>;; ; jump to [2]
[MII] nop.m 0x0
mov b0=r33
nop.i 0x0;;
[MIB] nop.m 0x0 ; [2]
nop.i 0x0
br.ret.sptk.many b0;;
I guess the moral of the story is that multi-threading is hard, but we
knew that anyway.