(this post was going to be about something else, but after getting this
far, I think it stands on its own as an introduction to dynamic linking)
The shared library is an integral part of a modern system, but often the
mechanisms behind the implementation are less well understood. There
are, of course, many guides to this sort of thing. Hopefully this adds
another perspective that resonates with someone.
Let's start at the beginning — - relocations are entries in binaries
that are left to be filled in later -- at link time by the toolchain
linker or at runtime by the dynamic linker. A relocation in a binary
is a descriptor which essentially says "determine the value of X, and
put that value into the binary at offset Y" — each relocation has a
specific type, defined in the ABI documentation, which describes
exactly how "determine the value of" is actually determined.
Here's the simplest example:
$ cat a.c
extern int foo;
int function(void) {
return foo;
}
$ gcc -c a.c
$ readelf --relocs ./a.o
Relocation section '.rel.text' at offset 0x2dc contains 1 entries:
Offset Info Type Sym.Value Sym. Name
00000004 00000801 R_386_32 00000000 foo
The value of foo is not known at the time you make a.o, so the
compiler leaves behind a relocation (of type R_386_32) which is
saying "in the final binary, patch the value at offset 0x4 in this
object file with the address of symbol foo". If you take a look at
the output, you can see at offset 0x4 there are 4-bytes of zeros just
waiting for a real address:
$ objdump --disassemble ./a.o
./a.o: file format elf32-i386
Disassembly of section .text:
00000000 <function>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: a1 00 00 00 00 mov 0x0,%eax
8: 5d pop %ebp
9: c3 ret
That's link time; if you build another object file with a value of foo
and build that into a final executable, the relocation can go away. But
there is a whole bunch of stuff for a fully linked executable or
shared-library that just can't be resolved until runtime. The major
reason, as I shall try to explain, is position-independent code (PIC).
When you look at an executable file, you'll notice it has a fixed load
address
$ readelf --headers /bin/ls
[...]
ELF Header:
[...]
Entry point address: 0x8049bb0
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
[...]
LOAD 0x000000 0x08048000 0x08048000 0x16f88 0x16f88 R E 0x1000
LOAD 0x016f88 0x0805ff88 0x0805ff88 0x01543 0x01543 RW 0x1000
This is not position-independent. The code section (with permissions
R E; i.e. read and execute) must be loaded at virtual address
0x08048000, and the data section (RW) must be loaded above that
at exactly 0x0805ff88.
This is fine for an executable, because each time you start a new
process (fork and exec) you have your own fresh address space.
Thus it is a considerable time saving to pre-calculate addresses from
and have them fixed in the final output (you can make
position-independent executables, but that's another story).
This is not fine for a shared library (.so). The whole point of a
shared library is that applications pick-and-choose random permutations
of libraries to achieve what they want. If your shared library is built
to only work when loaded at one particular address everything may be
fine — until another library comes along that was built also using that
address. The problem is actually somewhat tractable — you can just
enumerate every single shared library on the system and assign them all
unique address ranges, ensuring that whatever combinations of library
are loaded they never overlap. This is essentially what prelinking
does (although that is a hint, rather than a fixed, required address
base). Apart from being a maintenance nightmare, with 32-bit systems you
rapidly start to run out of address-space if you try to give every
possible library a unique location. Thus when you examine a shared
library, they do not specify a particular base address to be loaded at:
$ readelf --headers /lib/libc.so.6
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
[...]
LOAD 0x000000 0x00000000 0x00000000 0x236ac 0x236ac R E 0x1000
LOAD 0x023edc 0x00024edc 0x00024edc 0x0015c 0x001a4 RW 0x1000
Shared libraries also have a second goal — code sharing. If a hundred
processes use a shared library, it makes no sense to have 100 copies of
the code in memory taking up space. If the code is completely read-only,
and hence never, ever, modified, then every process can share the same
code. However, we have the constraint that the shared library must still
have a unqiue data instance in each process. While it would be possible
to put the library data anywhere we want at runtime, this would require
leaving behind relocations to patch the code and inform it where to
actually find the data — destroying the always read-only property of the
code and thus sharability. As you can see from the above headers, the
solution is that the read-write data section is always put at a known
offset from the code section of the library. This way, via the magic of
virtual-memory, every process sees its own data section but can share
the unmodified code. All that is needed to access data is some simple
maths; address of thing I want = my current address + known fixed
offset.
Well, simple maths is all relative! "My current address" may or may not
be easy to find. Consider the following:
$ cat test.c
static int foo = 100;
int function(void) {
return foo;
}
$ gcc -fPIC -shared -o libtest.so test.c
So foo will be in data, at a fixed offset from the code in
function, and all we need to do is find it! On amd64, this is quite
easy, check the disassembly:
000000000000056c <function>:
56c: 55 push %rbp
56d: 48 89 e5 mov %rsp,%rbp
570: 8b 05 b2 02 20 00 mov 0x2002b2(%rip),%eax # 200828 <foo>
576: 5d pop %rbp
This says "put the value at offset 0x2002b2 from the current instruction
pointer (%rip) into %eax. That's it — we know the data is at
that fixed offset so we're done. i386, on the other hand, doesn't have
the ability to offset from the current instruction pointer. Some
trickery is required there:
0000040c <function>:
40c: 55 push %ebp
40d: 89 e5 mov %esp,%ebp
40f: e8 0e 00 00 00 call 422 <__i686.get_pc_thunk.cx>
414: 81 c1 5c 11 00 00 add $0x115c,%ecx
41a: 8b 81 18 00 00 00 mov 0x18(%ecx),%eax
420: 5d pop %ebp
421: c3 ret
00000422 <__i686.get_pc_thunk.cx>:
422: 8b 0c 24 mov (%esp),%ecx
425: c3 ret
The magic here is __i686.get_pc_thunk.cx. The architecture does not
let us get the current instruction address, but we can get a known fixed
address — the value __i686.get_pc_thunk.cx pushes into cx is the
return value from the call, i.e in this case 0x414. Then we can
do the maths for the add instruction; 0x115c + 0x414 = 0x1570,
the final move goes 0x18 bytes past that to 0x1588 ... checking
the disassembly
00001588 <global>:
1588: 64 00 00 add %al,%fs:(%eax)
i.e., the value 100 in decimal, stored in the data section.
We are getting closer, but there are still some issues to deal with. If
a shared library can be loaded at any address, then how does an
executable, or other shared library, know how to access data or call
functions in it? We could, theoretically, load the library and patch up
any data references or calls into that library; however as just
described this would destroy code-sharability. As we know, all problems
can be solved with a layer of indirection, in this case called global
offset table or GOT.
Consider the following library:
$ cat test.c
extern int foo;
int function(void) {
return foo;
}
$ gcc -shared -fPIC -o libtest.so test.c
Note this looks exactly like before, but in this case the foo is
extern; presumably provided by some other library. Let's take a
closer look at how this works, on amd64:
$ objdump --disassemble libtest.so
[...]
00000000000005ac <function>:
5ac: 55 push %rbp
5ad: 48 89 e5 mov %rsp,%rbp
5b0: 48 8b 05 71 02 20 00 mov 0x200271(%rip),%rax # 200828 <_DYNAMIC+0x1a0>
5b7: 8b 00 mov (%rax),%eax
5b9: 5d pop %rbp
5ba: c3 retq
$ readelf --sections libtest.so
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[...]
[20] .got PROGBITS 0000000000200818 00000818
0000000000000020 0000000000000008 WA 0 0 8
$ readelf --relocs libtest.so
Relocation section '.rela.dyn' at offset 0x418 contains 5 entries:
Offset Info Type Sym. Value Sym. Name + Addend
[...]
000000200828 000400000006 R_X86_64_GLOB_DAT 0000000000000000 foo + 0
The disassembly shows that the value to be returned is loaded from an
offset of 0x200271 from the current %rip; i.e. 0x0200828.
Looking at the section headers, we see that this is part of the .got
section. When we examine the relocations, we see a R_X86_64_GLOB_DAT
relocation that says "find the value of symbol foo and put it into
address 0x200828.
So, when this library is loaded, the dynamic loader will examine the
relocation, go and find the value of foo and patch the .got
entry as required. When it comes time for the code loads to load that
value, it will point to the right place and everything just works;
without having to modify any code values and thus destroy code
sharability.
This handles data, but what about function calls? The indirection used
here is called a procedure linkage table or PLT. Code does not call an
external function directly, but only via a PLT stub. Let's examine
this:
$ cat test.c
int foo(void);
int function(void) {
return foo();
}
$ gcc -shared -fPIC -o libtest.so test.c
$ objdump --disassemble libtest.so
[...]
00000000000005bc <function>:
5bc: 55 push %rbp
5bd: 48 89 e5 mov %rsp,%rbp
5c0: e8 0b ff ff ff callq 4d0 <foo@plt>
5c5: 5d pop %rbp
$ objdump --disassemble-all libtest.so
00000000000004d0 <foo@plt>:
4d0: ff 25 82 03 20 00 jmpq *0x200382(%rip) # 200858 <_GLOBAL_OFFSET_TABLE_+0x18>
4d6: 68 00 00 00 00 pushq $0x0
4db: e9 e0 ff ff ff jmpq 4c0 <_init+0x18>
$ readelf --relocs libtest.so
Relocation section '.rela.plt' at offset 0x478 contains 2 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000200858 000400000007 R_X86_64_JUMP_SLO 0000000000000000 foo + 0
So, we see that function makes a call to code at 0x4d0.
Disassembling this, we see an interesting call, we jump to the value
stored in 0x200382 past the current %rip (i.e. 0x200858),
which we can then see the relocation for — the symbol foo.
It is interesting to keep following this through; let's look at the
initial value that is jumped to:
$ objdump --disassemble-all libtest.so
Disassembly of section .got.plt:
0000000000200840 <.got.plt>:
200840: 98 cwtl
200841: 06 (bad)
200842: 20 00 and %al,(%rax)
...
200858: d6 (bad)
200859: 04 00 add $0x0,%al
20085b: 00 00 add %al,(%rax)
20085d: 00 00 add %al,(%rax)
20085f: 00 e6 add %ah,%dh
200861: 04 00 add $0x0,%al
200863: 00 00 add %al,(%rax)
200865: 00 00 add %al,(%rax)
...
Unscrambling 0x200858 we see its initial value is 0x4d6 — i.e.
the next instruction! Which then pushes the value 0 and jumps to
0x4c0. Looking at that code we can see it pushes a value from the
GOT, and then jumps to a second value in the GOT:
00000000000004c0 <foo@plt-0x10>:
4c0: ff 35 82 03 20 00 pushq 0x200382(%rip) # 200848 <_GLOBAL_OFFSET_TABLE_+0x8>
4c6: ff 25 84 03 20 00 jmpq *0x200384(%rip) # 200850 <_GLOBAL_OFFSET_TABLE_+0x10>
4cc: 0f 1f 40 00 nopl 0x0(%rax)
What's going on here? What's actually happening is lazy binding — by
convention when the dynamic linker loads a library, it will put an
identifier and resolution function into known places in the GOT.
Therefore, what happens is roughly this: on the first call of a
function, it falls through to call the default stub, which loads the
identifier and calls into the dynamic linker, which at that point has
enough information to figure out "hey, this libtest.so is trying to
find the function foo". It will go ahead and find it, and then patch
the address into the GOT such that the next time the original PLT
entry is called, it will load the actual address of the function, rather
than the lookup stub. Ingenious!
Out of this indirection falls another handy thing — the ability to
modify the symbol binding order. LD_PRELOAD, for example, simply
tells the dynamic loader it should insert a library as first to be
looked-up for symbols; therefore when the above binding happens if the
preloaded library declares a foo, it will be chosen over any other
one provided.
In summary — code should be read-only always, and to make it so that you
can still access data from other libraries and call external functions
these accesses are indirected through a GOT and PLT which live at
compile-time known offsets.
In a future post I'll discuss some of the security issues around this
implementation, but that post won't make sense unless I can refer back
to this one :)