Some notes on ptrace for IA64

As a background you need to understand the instruction format of IA64. Itanium groups instructions into groups of three called "bundles". Each of the three instructions are in a slot (slot0-2). Each instruction is 41 bits, and there is 5 bits of template information (making for 128 bit bundles). There are rules about what instructions can be bundled together and in what order they come (the templates). This allows the compiler to determine optimal bundling ... the theory being that the compiler has more information about what is happening (having access to the source code) so it can make best use of the processor resources rather than the processor having to guess what is happening at runtime. This is why it is important to use a good compiler to get good results out of the Itanium processor.

Using the linux ptrace() call we can step through the instructions a program is executing.

#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include <sys/ptrace.h>
#include <sys/wait.h>

char *prog_name;

#include <asm/ptrace.h>
#include <asm/ptrace_offsets.h>

static int lastslot = 0;

union bundle_t {
        struct {
                struct {
                        unsigned long template  : 5 ;
                        unsigned long slot0     : 41;
                        unsigned long bot_slot1 : 18;
                } word0;
                struct  {
                        unsigned long top_slot1 : 23;
                        unsigned long slot2     : 41;
                } word1;
        } bitfield;
        unsigned long array[2];
};

void
print_instruction (int child_pid, int state)
{
        long scnum;
        long ip, slot;
        union bundle_t bundle;

        ip = ptrace (PTRACE_PEEKUSER, child_pid, PT_CR_IIP, 0);
        slot = (ptrace (PTRACE_PEEKUSER, child_pid, PT_CR_IPSR, 0) >> 41) & 0x3;
        scnum = ptrace (PTRACE_PEEKUSER, child_pid, PT_R15, 0);

        printf("%lx %d\n", ip, slot);
}

int
main (int argc, char **argv, char **envp)
{
        int status, pid, child_pid, state = 1, arg = 1;

        prog_name = argv[0];

        child_pid = fork ();
        if (child_pid == 0)
        {
                ptrace (PTRACE_TRACEME, 0, 0, 0);
                execve (argv[arg], argv + arg, envp);
                printf ("%s: execve failed (errno=%d)\n", prog_name, errno);
                exit(-2);
        }

        while (1)
        {
                pid = wait4 (-1, &status, 0, 0);
                if (pid == -1)
                {
                        if (errno == EINTR)
                                continue;

                        printf ("%s: wait4() failed (errno=%d)\n", prog_name, errno);
                }

                if (WIFSIGNALED (status) || WIFEXITED (status)
                    || (WIFSTOPPED (status) && WSTOPSIG (status) != SIGTRAP))
                {
                        if (WIFEXITED (status))
                        {
                                printf ("%s: exit status %d\n", prog_name, WEXITSTATUS (status));
                                break;
                        }
                        else if (WIFSIGNALED (status))
                        {
                                printf ("%s: terminated by signal %d\n",
                                        prog_name, WTERMSIG (status));
                        }
                        else
                                printf ("%s: got signal %d\n", prog_name, WSTOPSIG (status));
                }

                print_instruction (child_pid, state);
                ptrace (PTRACE_SINGLESTEP, child_pid, 0, 0);

        }
        return 0;
}

This will produce a lot of output which should show you increasing instruction pointer values and slot values. Each exception happens after the instruction, so to see what you just executed you have to go back one slot.

...
20000000000043d0 0
20000000000043d0 1
20000000000043d0 2
20000000000043e0 0
20000000000043e0 1
20000000000043e0 2
...

So far, that's all quite straight forward. The only tricky bit comes around system calls. To make a system call on IA64 you put the system call number into r15 and execute a break 0x100000 call (well, you used to, until fast system calls were introduced).

But you will have issues single stepping around the break system call, because it has a higher priority than the single step exception. This means that the system call will be handled, the instruction pointer updated, the next instruction executed and then you'll get your fault.

To illustrate with an example, image a function like

400000000000e500 :
400000000000e500:       01 10 24 02 80 05       [MII]       alloc r2=ar.pfs,9,1,0
400000000000e506:       f0 00 80 00 42 00                   mov r15=r32
400000000000e50c:       00 00 00 08                         break.i 0x100000;;
400000000000e510:       13 00 fc 15 06 bb       [MBB]       cmp.eq p0,p6=-1,r10
400000000000e516:       41 00 00 42 00 00             (p06) br.ret.sptk.few b0
400000000000e51c:       30 51 00 41                         br.cond.spnt.few 4000000000013640 <__syscall_error>;;

You're going to see output like

400000000000e500 0 <--- after call into function
400000000000e500 1 <--- slot 0 of syscall first bundle
400000000000e500 2 <--- slot 1 of syscall first bundle
                   <--- system call handled (nothing printed)
400000000000e510 1 <--- slot 0 of syscall second bundle
400000000000e510 2 <--- etc

Of course, you can change the ptrace argument to PTRACE_SYSCALL and you will get two faults at 0x400000000000e50c ... one on entry and one on exit.

On the Linux development model

Since 2.6 it seems that the distinction between the "stable" series and "unstable" series has pretty much disappeared. As someone who needs to keep up with current developments, this is often a real pain as it is like trying to build a house on quicksand. But one thing I didn't think of was that this increases compatibility; as evidenced by the model doing exactly the opposite ... in this post Andrew Morton suggests that suse may have caused two interfaces to the same thing due to releasing before something was accepted into the kernel.org sources.

So there's one advantage to having the moving releases as we do now -- everyone has no excuses not to keep pushing their stuff into the official trees.

Comparing some of the 386, AMD64 and IA64 ABI

Apart from the obvious 32-64 bit distinction between 386 and AMD64 there are two other interesting comparisons; parameter passing conventions and position independent code conventions.

Parameter Passing : on x86 parameters are passed via the stack. On AMD64 the first six "integer" arguments (anything that fits in a 64 bit register, basically) are passed via registers, similarly some floats can be passed via SSE registers. Only after this is data passed on the stack. On IA64, the first 8 arguments are passed in registers, whilst the rest are put on the stack.

On both AMD64 and IA64, there is a extra 16 byte "scratch area" (IA64) / 128 byte "red zone" (AMD64) that is below at the bottom of current stack frame. I would suggest that the smaller IA64 scratch area size is because of register windowing, which AMD64 does not support. On both architectures this is reserved and not modified by signal or interrupt handlers. "Leaf functions" (functions that do not call other functions) can use this area as their entire stack frame; saving some considerable overhead.

For varargs functions causes some confusion for AMD64/IA64, since arguments might be floats or might be integers, meaning they should be passed in either general or float/SSE registers respectively. On AMD64, functions known to be varargs functions should have a prologue that saves all arguments to a "register save area" that has a known layout (you pass the maximum number of possible floating point args as well to avoid saving unnecessary registers). Then, as you use the va_arg macro to go through the arguments you grab them from the register save area. On IA64, you assume that the first 8 arguments are passed in via the stack, and save these registers to your scratch area (2 registers) and 48 bytes of your stack (remaining 6 registers). This means all your arguments are stacked together (the incoming parameter list sits up against the scratch area) and va_arg can simply "walk" upwards.

Undefined functions are a bit more tricky; IA64 suggests that if a float is passed into a function with an undefined parameter, it should be copied to both the first general purpose register and the first floating point register, just to be safe. AMD64 doesn't seem to make such assumptions for you, for example, on IA64

ianw@lime:/tmp$ cat function.c
void function(float f)
{
        printf("%f\n", f);
}
ianw@lime:/tmp$ cat test.c
extern function();

int main(void)
{
        float f = 10000.01;
        function(f);
}
ianw@lime:/tmp$ gcc -o test test.c function.c
ianw@lime:/tmp$ ./test
10000.009766

That same code on AMD64 returns 0.

IP relative addressing: Position Independent Code (PIC) is code that can be loaded anywhere into memory and work. This is important because shared libraries may not always be at the same address, since other shared libraries might be loaded before or after them, etc. To maintain position independence, you can't rely on the base address of any code (because it might change) so you add a layer of indirection between your calls. In Linux/ELF land this is done with a Global Offset Table (GOT).

You can think of the GOT as a big list two columned list that has a symbol and it's "real address". Thus, instead of loading the symbol directly, you load the value from the GOT, and then load that value to find the real thing.

Note, you always know the relative address of the GOT, because although the base address might change, the difference between your code and where the GOT is will not. This means that if you need to load an address from the GOT, the easiest way is to load via an offset from the current instruction from the GOT entry. The compiler knows the current instruction offset (note it can't know the current instruction address, because the binary might be anywhere in memory), so it wants to say load the address at (CURRENT_INSTRUCTION - OFFSET_TO_GOT_ENTRY).

386 just can't do this -- there is no way to load an offset from the current instruction pointer. The only way you can do it is to keep a pointer to the GOT in a register (%ebp), and then offset from that. This wastes a whole register, and when you only have a few like the 386 this is a big killer.

AMD64 fixes this and allows you to offset from the current instruction pointer. This frees up a register, and changes the ABI by removing the distinction between the Absolute PLT and PIC PLT.

The PLT is a further enhancement that facilitates lazy binding. The PLT is "stubs" that point to a fix up function in the dynamic loader. At first, the GOT entries for functions point to the PLT entry for that function.

When you call the function, you don't go directly to it, you load it's value via the GOT and then jump to that value. As mentioned, at first this points to the PLT stub. This calls the lookup function in the dynamic loader which goes off and finds the real function (this might actually be in another shared library that needs to be loaded, for example). As arguments to this lookup function you pass the function name you're looking for (obviously) and the GOT entry of the original call. The dynamic loader finds the function, but then additionally fixes up the GOT entry to no longer point to the PLT stub, but to point directly to the required function. This means the next time you load from the GOT, you get the direct address of the function without the overhead of the PLT stub again.

IA64, allowing IP relative addressing, similarly doesn't have a distinction between absolute and PIC PLT's.

Mandrake 10 for Itanium 2

Seeing as I have broken everyone's access to the main server at work by blowing out the IP quota by downloading the DVD ISO of the new Mandrake 10 for Itanium 2; I should at least write a few notes.

The installation was on a stock rx2600 with nothing really very interesting. The first problem was that it doesn't seem to boot from the EFI menu. This was easily fixed by slipping into the EFI console and manually booting the elilo included. This leads into the second problem, which is that by default the console goes to the VGA output which isn't great when you're using the managment card (as would be the usual case with a Itanium machine, I'm guessing). This is also easily fixed by appending console=ttyS0 ... but the default elilo boots straight into Linux without giving a chance to specify options, so you have to quickly break into elilo while it's decompressing the kernel.

At this stage, I managed to boot the installation. I just kept pressing enter until it stopped, when I realised that it was launching the X11 installer. This is bad choice for an Itanium machine ... especially for distribution aimed at clusters. I was left with no apparent way to get the installer working ... at a guess I decided to add text as a kernel command line paramater, which seems to drop you into the text installer.

You have to press enter each time it loads a module, and it seems to be loading different modules than it says it is.

The text install gives all sorts of whacky characters and highlights don't work correctly. This might be a terminal setting for the managment card, but I'm not sure.

Once I figured out how to tell the partitioner to auto-invoke and just partition the disk how it wanted, the next step simply hung, doing nothing. I restarted and tried as best I could to manually partition, and it still hung doing nothing. I tried again telling it to use the existing partitions, and it still hung.

So I currently haven't got any further. I'll update this if I ever do (I'm trying to subscribe to the beta list but it doesn't seem to exist), but so far the Debian installer is looking pretty good.

on threads under linux

Other questions that come up about threads

How many threads can I run? This depends on a number of things

  • The ulimit of the user. Set the number of threads with ulimit -u (or somewhere like /etc/security/limits.conf).

  • The default stack size. By default, the stack size of a new thread is quite large, in the region of megabytes. It's not going to take you long to run out of memory for new stacks. Luckily, pthread_attr_setstacksize() allows you use smaller stacks.

  • The kernel. The kernel won't allow you to fill up your entire memory with thread descriptors. See kernel/fork.c:fork_init()

    /*
      * The default maximum number of threads is set to a safe
      * value: the thread structures can take up at most half
      * of memory.
      */
      max_threads = mempages / (8 * THREAD_SIZE / PAGE_SIZE);
    

    I belive this works out to around 4000 threads on a 256Mb x86 machine; YMMV of course.

What does ps show me? By default, ps should only show you the parent thread. Try with -m for the child threads.