Apart from the obvious 32-64 bit distinction between 386 and AMD64 there
are two other interesting comparisons; parameter passing conventions and
position independent code conventions.
Parameter Passing : on x86 parameters are passed via the stack. On
AMD64 the first six "integer" arguments (anything that fits in a 64 bit
register, basically) are passed via registers, similarly some floats can
be passed via SSE registers. Only after this is data passed on the
stack. On IA64, the first 8 arguments are passed in registers, whilst
the rest are put on the stack.
On both AMD64 and IA64, there is a extra 16 byte "scratch area" (IA64) /
128 byte "red zone" (AMD64) that is below at the bottom of current stack
frame. I would suggest that the smaller IA64 scratch area size is
because of register windowing, which AMD64 does not support. On both
architectures this is reserved and not modified by signal or interrupt
handlers. "Leaf functions" (functions that do not call other functions)
can use this area as their entire stack frame; saving some considerable
overhead.
For varargs functions causes some confusion for AMD64/IA64, since
arguments might be floats or might be integers, meaning they should be
passed in either general or float/SSE registers respectively. On AMD64,
functions known to be varargs functions should have a prologue that
saves all arguments to a "register save area" that has a known layout
(you pass the maximum number of possible floating point args as well to
avoid saving unnecessary registers). Then, as you use the va_arg
macro to go through the arguments you grab them from the register save
area. On IA64, you assume that the first 8 arguments are passed in via
the stack, and save these registers to your scratch area (2 registers)
and 48 bytes of your stack (remaining 6 registers). This means all your
arguments are stacked together (the incoming parameter list sits up
against the scratch area) and va_arg can simply "walk" upwards.
Undefined functions are a bit more tricky; IA64 suggests that if a float
is passed into a function with an undefined parameter, it should be
copied to both the first general purpose register and the first floating
point register, just to be safe. AMD64 doesn't seem to make such
assumptions for you, for example, on IA64
ianw@lime:/tmp$ cat function.c
void function(float f)
{
printf("%f\n", f);
}
ianw@lime:/tmp$ cat test.c
extern function();
int main(void)
{
float f = 10000.01;
function(f);
}
ianw@lime:/tmp$ gcc -o test test.c function.c
ianw@lime:/tmp$ ./test
10000.009766
That same code on AMD64 returns 0.
IP relative addressing: Position Independent Code (PIC) is code that
can be loaded anywhere into memory and work. This is important because
shared libraries may not always be at the same address, since other
shared libraries might be loaded before or after them, etc. To maintain
position independence, you can't rely on the base address of any code
(because it might change) so you add a layer of indirection between your
calls. In Linux/ELF land this is done with a Global Offset Table (GOT).
You can think of the GOT as a big list two columned list that has a
symbol and it's "real address". Thus, instead of loading the symbol
directly, you load the value from the GOT, and then load that value
to find the real thing.
Note, you always know the relative address of the GOT, because
although the base address might change, the difference between your code
and where the GOT is will not. This means that if you need to load an
address from the GOT, the easiest way is to load via an offset from the
current instruction from the GOT entry. The compiler knows the current
instruction offset (note it can't know the current instruction
address, because the binary might be anywhere in memory), so it wants
to say
load the address at (CURRENT_INSTRUCTION - OFFSET_TO_GOT_ENTRY).
386 just can't do this -- there is no way to load an offset from the
current instruction pointer. The only way you can do it is to keep a
pointer to the GOT in a register (%ebp), and then offset from that.
This wastes a whole register, and when you only have a few like the 386
this is a big killer.
AMD64 fixes this and allows you to offset from the current instruction
pointer. This frees up a register, and changes the ABI by removing the
distinction between the Absolute PLT and PIC PLT.
The PLT is a further enhancement that facilitates lazy binding. The PLT
is "stubs" that point to a fix up function in the dynamic loader. At
first, the GOT entries for functions point to the PLT entry for that
function.
When you call the function, you don't go directly to it, you load
it's value via the GOT and then jump to that value. As mentioned, at
first this points to the PLT stub. This calls the lookup function in the
dynamic loader which goes off and finds the real function (this
might actually be in another shared library that needs to be loaded, for
example). As arguments to this lookup function you pass the function
name you're looking for (obviously) and the GOT entry of the original
call. The dynamic loader finds the function, but then additionally
fixes up the GOT entry to no longer point to the PLT stub, but to
point directly to the required function. This means the next time you
load from the GOT, you get the direct address of the function without
the overhead of the PLT stub again.
IA64, allowing IP relative addressing, similarly doesn't have a
distinction between absolute and PIC PLT's.