A little tour of linux-gate.so

Mon 15 August 2005

A few people have noticed and wondered what linux-gate.so.1 is in their binaries with newer libc's.

ianw@morrison:~$ ldd /bin/ls
        linux-gate.so.1 =>  (0xffffe000)
        librt.so.1 => /lib/tls/librt.so.1 (0xb7fdb000)
        libacl.so.1 => /lib/libacl.so.1 (0xb7fd5000)
        libc.so.6 => /lib/tls/libc.so.6 (0xb7e9c000)
        libpthread.so.0 => /lib/tls/libpthread.so.0 (0xb7e8a000)
        /lib/ld-linux.so.2 (0xb7feb000)
        libattr.so.1 => /lib/libattr.so.1 (0xb7e86000)

It's actually a shared library that is exported by the kernel to provide a way to make system calls faster. Most architectures have ways of making system calls that are less expensive than taking a full trap; sysenter on x86 (syscall on AMD I think) and epc on IA64 for example.

If you want the gist of how it works, first we can pull it apart. The following program reads and dumps the so on a x86 machine. Note it's just a kernel page, so you can just dump getpagesize() should you want to; though you can't directly call write on it (i.e. you need to memcpy and then write). Below I pull apart the headers.

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <elf.h>
#include <alloca.h>

int main(void)
{
  int i;
  unsigned size = 0;
  char *buf;

  Elf32_Ehdr *so = (Elf32_Ehdr*)0xffffe000;
  Elf32_Phdr *ph = (Elf32_Phdr*)((void*)so + so->e_phoff);

  size += so->e_ehsize + (so->e_phentsize * so->e_phnum);

  for (i = 0 ; i < so->e_phnum; i++)
    {
      size += ph->p_memsz;
      ph = (void*)ph + so->e_phentsize;
    }

  buf = alloca(size);
  memcpy(buf, so, size);

  int f = open("./kernel-gate.so", O_CREAT|O_WRONLY, S_IRWXU);

  int w = write(f, buf, size);

  printf("wrote %d (%s)\n", w, strerror(errno));

}

At this stage you should have a binary you can look at with, say readelf.

ianw@morrison:~/tmp$ readelf --symbols ./kernel-gate.so

Symbol table '.dynsym' contains 15 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
  [--snip--]
    11: ffffe400    20 FUNC    GLOBAL DEFAULT    6 __kernel_vsyscall@@LINUX_2.5
    12: 00000000     0 OBJECT  GLOBAL DEFAULT  ABS LINUX_2.5
    13: ffffe440     7 FUNC    GLOBAL DEFAULT    6 __kernel_rt_sigreturn@@LINUX_ 2.5
    14: ffffe420     8 FUNC    GLOBAL DEFAULT    6 __kernel_sigreturn@@LINUX_2.5

__kernel_vsyscall is the function you call to do the fast syscall magic. But I bet you're wondering just how that gets called?

It's easy if you poke inside the auxiliary vector that is passed to ld, the dynamic loader by the kernel. There's a couple of ways to see it; via an environment flag, peeking into /proc/self/auxv or on PowerPC it is passed as the forth argument to main().

ianw@morrison:~/tmp$ LD_SHOW_AUXV=1 /bin/true
AT_SYSINFO:      0xffffe400
AT_SYSINFO_EHDR: 0xffffe000
AT_HWCAP:    fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
AT_PAGESZ:       4096
AT_CLKTCK:       100
AT_PHDR:         0x8048034
AT_PHENT:        32
AT_PHNUM:        7
AT_BASE:         0xb7feb000
AT_FLAGS:        0x0
AT_ENTRY:        0x8048960
AT_UID:          1000
AT_EUID:         1000
AT_GID:          1000
AT_EGID:         1000
AT_SECURE:       0
AT_PLATFORM:     i686

Notice how the AT_SYSINFO symbols refers to the fast system call function in our kernel shared object? Also notice that the EHDR flag points to the library its self.

If you start to poke through the glibc source code and look how the sysinfo entry is handled you can see the dynamic linker will choose to use the library function for system calls if it is available. If that flag is never passed by the kernel it can fall back to the old way of doing things.

IA64 works in the same way, although we keep our kernel shared library at 0xa000000000000000. You can see how the shared object is quite an elegant design that allows maximum compatibility across and within architectures, since you have abstracted the calling mechanism away from userspace. A 386 can call the same way as a Pentium IV through the library and the kernel will make sure the appropriate thing is done in __kernel_vsyscall.

Why does mdadm drop a drive when creating a RAID 5 array?

Thu 07 July 2005

I've had the opportunity to play with software RAID over the last few days, and one thing that confused the heck out of me was why I created my fresh RAID 5 array with a few disks, and from the start one of the disks came up missing.

Eventually I found in the mdadm man page

Normally mdadm will not allow creation of an array with only one device, and will try to create a raid5 array with one missing drive (as this makes the initial resync work faster). With --force, mdadm will not try to be so clever.

The mdadm code enlightens me a bit more

/* If this is  raid5, we want to configure the last active slot
 * as missing, so that a reconstruct happens (faster than re-parity)

But I'm still wondering why. I think I understand from following the code (if you know the code, please correct any mis-assumptions). Firstly, md.c schedules a resync when it notices things are out of sync, like when an array is created and the drives are not marked in sync or when there is a missing drive and a spare.

md goes through the sectors in the disk one by one (doing a bit of rate limiting) and calls sync_request for the underlying RAID layer for each sector. For RAID5 this firstly converts the sector into a stripe and sets some flags on that stripe (STRIPE_SYNCING). Each stripe structure has a buffer for each disk in the array, amongst other things (struct stripe_head in include/linux/raid/raid5.h). It then calls handle_stripe to, well, handle the stripe.

/* maybe we need to check and possibly fix the parity for this stripe
 * Any reads will already have been scheduled, so we just see if enough data
 * is available
 */
if (syncing && locked == 0 &&
    !test_bit(STRIPE_INSYNC, &sh->state) && failed <= 1) {
    set_bit(STRIPE_HANDLE, &sh->state);
    if (failed == 0) {
        char *pagea;
        if (uptodate != disks)
            BUG();
        compute_parity(sh, CHECK_PARITY);
        uptodate--;
        pagea = page_address(sh->dev[sh->pd_idx].page);
        if ((*(u32*)pagea) == 0 &&
            !memcmp(pagea, pagea+4, STRIPE_SIZE-4)) {
            /* parity is correct (on disc, not in buffer any more) */
            set_bit(STRIPE_INSYNC, &sh->state);
        }
    }
    if (!test_bit(STRIPE_INSYNC, &sh->state)) {
        if (failed==0)
            failed_num = sh->pd_idx;
        /* should be able to compute the missing block and write it to spare */
        if (!test_bit(R5_UPTODATE, &sh->dev[failed_num].flags)) {
            if (uptodate+1 != disks)
                BUG();
            compute_block(sh, failed_num);
            uptodate++;
        }
        if (uptodate != disks)
            BUG();
        dev = &sh->dev[failed_num];
        set_bit(R5_LOCKED, &dev->flags);
        set_bit(R5_Wantwrite, &dev->flags);
        locked++;
        set_bit(STRIPE_INSYNC, &sh->state);
        set_bit(R5_Syncio, &dev->flags);
    }
}

We can inspect one particular part of that code that can get triggered if we're running a sync. If we don't have a failed drive (e.g. failed == 0) then we call into compute_parity to check the parity of the current stripe sh (for stripe header). If the parity is correct, then we flag the stripe as in sync, and carry on.

However, if the parity is incorrect, we will fall out to the next if statement. The first if checks if we have a failed drive; if we don't then we flag the failed drive as the disk with the parity for this stripe (sh->pd_idx). This is where the optimisation comes in -- if we have a failed drive then we pass it in directly to compute_block meaning that we will always be bringing the "failed" disk into sync with the other drives. We then set some flags so that the lower layers know to write out that drive.

The alternative is to update the parity, which in RAID5 is spread across the disks. This means you're constantly stuffing up nice sequential reads by making the disks seek backwards to write parity for previous stripes. Under this scheme, for a situation where parity is mostly wrong (e.g. creating a new array) you just read from the other drives and bring that failed drive into sync with the others. Thus the only disk doing the writing is the spare that is being rebuilt, and it is writing in a nice, sequential manner. And the other disks are reading in a nice sequential manner too. Ingenious huh!

I did run this by Neil Brown to make sure I wasn't completely wrong, and he pointed me to a recent email where he explains things. If anyone knows, he does!

remap_file_pages example

Mon 02 May 2005

remap_file_pages modifies an existing mmap of a file to point to different pages. When you mmap a file, the first page in memory points to the first page of the file on disk, the second page in memory to the second page on disk, etc.

If you don't want this, e.g. you want the first page in memory to refer to some other page, you would usually have to multiple mmap operations. If you do this a lot, it can slow the kernel down keeping track of it all.

Instead, mmap the file and then use remap_file_pages to modify that mmap to point into different places in the file. In the example below, we create a temporary file, map it into memory and then "turn it around" ... that is the first page in memory is the last page on the disk, and so on.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <string.h>

#define DEFAULT_MMAP_SIZE (1024*1024*10) //10MB

#define PASSED 0
#define FAILED 1

static inline char page_hash(int page)
{
        return (char)(page % 256);
}

static inline char *page_offset_to_addr(char *start, int page)
{
        return start + (getpagesize() * page);
}

int genfile(char *file, const size_t size)
{
        ssize_t bytes = 0;
        int fd = 0;
        char *buf;
        int i, page;

        if (!mkstemp(file))
                return FAILED;

        fd = open(file, O_RDWR);
        if (fd == -1)
                return FAILED;

        buf = malloc( getpagesize() );

        printf("Writing out %d pages to %s\n", (int)(size / getpagesize()), file);

        for (page = 0 ; page < (size / getpagesize()) ; page++)
        {
                for (i = 0 ; i < getpagesize() ; i++)
                        buf[i] = page_hash(page);

                bytes += write(fd, buf, getpagesize());
        }
        close(fd);
        sync();
        return PASSED;
}

int compare(char *remapped_file, char *file, unsigned long size)
{
        char *mmap_orig_file;
        int i = 0, fd = open(file, O_RDONLY);
        int err = FAILED;

        if (!remapped_file || fd == -1)
                return FAILED;

        /* map in the file from disk, again */
        if ((mmap_orig_file =
             mmap(0, size, PROT_READ, MAP_SHARED, fd,
                  0)) == MAP_FAILED) {
                goto out_mmap_fail;
        }

        /* walk the original backwards and compare it to the remapped
         * file going forwards, page by page.  they should be the
         * same.
         */
        int cur_remap_page = 0;
        int cur_orig_page  = (size / getpagesize()) - 1;

        while (cur_orig_page >= 0)
        {
                printf("compare %05d -> %05d\r", cur_remap_page, cur_orig_page);
                if ((i = memcmp(page_offset_to_addr(mmap_orig_file, cur_orig_page),
                                page_offset_to_addr(remapped_file, cur_remap_page),
                                getpagesize()) != 0)) {
                        err = FAILED;
                        goto out;
                }
                cur_remap_page++;
                cur_orig_page--;
        }
        printf("\n");
        err = PASSED;
 out:
        munmap(mmap_orig_file, size);
 out_mmap_fail:
        close(fd);
        return err;
}

int main(void)
{
        int fd;
        const char *tmp_default = "/tmp/remap-testXXXXXX";
        size_t size = DEFAULT_MMAP_SIZE;
        char file[256], *addr;

        int err = FAILED;
        int sys_pagesize = getpagesize();

        strcpy(file, tmp_default);
        genfile(file, size);

        /*
         *  Map in the file
         */
        fd = open(file, O_RDWR);
        if ((addr =
             mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
                  0)) == MAP_FAILED)
                goto out;

        /* Turn the file around with remap_file_pages; that is the
         * last page of the file on disk becomes the first page in
         * memory, the second last page the second page in memory,
         * etc.
         */
        int cur_mmap_page = (size / sys_pagesize) - 1;
        int cur_file_page = 0;
        while (cur_mmap_page >= 0)
        {
                remap_file_pages(page_offset_to_addr(addr, cur_mmap_page),
                                 sys_pagesize, 0, cur_file_page, 0);
                cur_mmap_page--;
                cur_file_page++;
        }

        err = compare(addr, file, size);
        if (err == FAILED)
                printf("Test Failed!\n");
        else
                printf("Test Passed!\n");
 out:
        close(fd);
        unlink(file);
        exit(err);
}

magic sysrq via /proc

Tue 19 April 2005

If you can't type magic-sysrq for some reason, you can echo the magic sysreq key to /proc/sysrq-trigger for the same effect.

baci:~# echo t > /proc/sysrq-trigger
SysRq : Show State

                                                       sibling
  task                 PC          pid father child younger older
init          S a0000001006363c0     0     1      0     2               (NOTLB)

Call Trace:
 [] schedule+0xbf0/0x1280
                                sp=e00000407fe07d80 bsp=e00000407fe010e8
 [] schedule_timeout+0x100/0x1a0
                                sp=e00000407fe07d90 bsp=e00000407fe010b0
 [] do_select+0x270/0x4c0
                                sp=e00000407fe07dd0 bsp=e00000407fe00f80
 [] sys_select+0x590/0x8c0
                                sp=e00000407fe07df0 bsp=e00000407fe00ea0
 [] ia64_ret_from_syscall+0x0/0x20
                                sp=e00000407fe07e30 bsp=e00000407fe00ea0
migration/0   S a000000100081010     0     2      1             3       (L-TLB)

PPC to i386 Cross Compiler

Sat 16 April 2005

If you're planning on attending the kernel tute at LCA05 but have an Apple laptop, you might be interested in my cross compiler debian packages.

I would highly recommend you don't try try this yourself :) Originally I planned to write a little note on how to do it, but after about the third hour I'd forgotten what I'd hacked to get to the point I was at. If you do try, don't bother with gcc-3.4; it has some sort of heisenbug that segfaults the assembler in various ways. And just getting C and not g++/ada/pascal/java/etc was also harder than it should be.

Anyway, with the packages installed you need to build the kernel with two extra flags.

$ make CROSS_COMPILE=i386-linux- ARCH=i386

Other than that, the built binary boots fine for me in qemu on my iBook!

update: If you want to compile userspace apps, you will need to use tpkg-install-libc from toolchain-source.

Discontiguous Memory

Fri 15 April 2005

Understanding why we have discontiguous memory requires some understanding of the chipsets the power Itanium systems. The ZX1 chipset, which powers the HP Integrity line of servers, is one of the more common ones. Others chipsets would follow similar principles.

With the ZX1, the processor is not connected directly to RAM chips, but runs through a memory I/O controller or MIO for short. It looks something like

+------------------------+   Processor Bus                       Memory Bus    +-----------------------+
|  Itanium 2 Processor 1 | <--------+        +---------------+       +-------> |       RAM DIMM        |
+------------------------+          |------> |   ZX1 MIO     |<------|         +-----------------------+
|  Itanium 2 Processor 2 | <--------+        +---------------+       +-------> |       RAM DIMM        |
+------------------------+                    | | | | | | | |                  +-----------------------+
                                     Ropes->  | | | | | | | |
                                             +---------------+
                                             |    ZX1 IOA    |
                                             +---------------+
                                                |   |   |
                                               agp pci pci-x

From a top level view, the chipset is broadly divided into the System Bus Adapter (or SBA) and the Lower Bus Adapater (LBA). You can think of the SBA as an interface between the bus that the processors sits on and the ropes bus (below). The LBA is the PCI Host bus adapater, and contains the IO SAPIC (Streamlined Advanced Programmable Interrupt Controller, this chip handles interrupts from PCI devices, and can be thought of as a smart "helper" for the CPU that controls when it sees interrupts).

A rope is described as a "fast, narrow point-to-point connection between the zx1 mio and a rope guest device (the zx1 ioa). The rope guest device is responsible for bridging from the I/O rope to an industry standard I/O bus (PCI, AGP, and PCI-X)". The ZX1 chipset can support up to 8 ropes, which can be bundled.

This is distinct from the typical North bridge/South Bridge (where the north bridge connects to the CPU/Memory and south bridge, and the south bridge connects to the north bridge and slower I/O pehperials) that it used in 386 architectures. This does however introduce some more interesting timing problems.

A simplified map of the memory layout presented by the ZX1 chipset MIO looks like

0xFFF FFFF FFFF +-------------------------------------+
                |               Unused                |
0xF80 0000 0000 +-------------------------------------+
                |  Greater than 4GB MMIO :            |
                |   (not currently used by anything)  |
0xF00 0000 0000 +-------------------------------------+
                |               Unused                |
                                 ....
                |                                     |
0x041 0000 0000 +-------------------------------------+
                |           Memory 1 (3GB)            |
                |                                     |
                |                                     |
0x040 4000 0000 +-------------------------------------+
                |               Unused                |
0x040 0000 0000 +-------------------------------------+
                |           Memory 2 (252GB)          |
                                  ....
                |                                     |
0x001 0000 0000 +-------------------------------------+
                |  Less than 4GB MMIO :               |
                |   Access to firmware, processor     |
                |   reserved space, chipset registers |
                |   etc (3GB)                         |
0x000 8000 0000 +-------------------------------------+
                |           Virutal I/O (1GB)         |
0x000 4000 0000 +-------------------------------------+
                |           Memory 0 (1GB)            |
0x000 0000 0000 +-------------------------------------+

You can see it implements a 44 bit address space that is divided up into a range of different sections. The maximum memory it can theoretically support is 256GB (i.e. Memory 0 + Memory 1 + Memory 2) however physically I think most boxes only comes with enough slots for ~16Gb of memory. The ZX1 has an extension called the zx1 sme or Scalable Memory Extension that allows more memory to be delt with. 1.3. Looking at the layout in more depth 1.3.1. Virtual I/O

The ZX1 chipset implements an IO-MMU, or IO Memory Managment unit, for DMA. This is very similar to the standard MMU, or memory managment unit in that converts a virtual address into a physical address. This is particularly important for a card 32 bit card that can not understand a 64 bit address. By setting up the virtual I/O region under 4GB (i.e. the maximum address a 32 bit card can deal with) the IOMMU can translate the address back to a full 64 bit address and give the 32 bit card access to all memory. So the process goes something like

driver issues DMA request that is in an area above 4GB that a 32 bit card can not deal with
IO-MMU sets up a virtual address in the Virtual I/O region that points to the original address, and issues the DMA request with this low address
The device sends it's data back to the low address, which the IOMMU then translates back to the high address.

The IOTLB in the ZX1 chipset is only big enough to map 1GB at time, which is why this area is only 1GB. This means there can be contention for this space if many drivers are requesting large DMA transfers -- however note 64 bit capable cards can bypass this all together reducing the problem.

As you can see from the layout above, system RAM is not presented to the processor in a contiguous manner; that is all the memory in one large block. It is split up into three different sections (Memory 0, 1, 2) which are placed in different areas of the address space. Memory 0 needs to be around for legacy applications that expect certain things at low addresses. The rest of the < 4Gb area is taken up by the Low Memory Mapped IO region; the physical memory that would have been mapped here is moved into Memory 1, which is located at 0x040 0000 0000 (256Gb). Should there be more than 4GB of memory (i.e. Memory 0 + 1 = 4Gb) it will be allocated in Memory 2 at 0x001 0000 0000 (4Gb).

The other thing you need to know about is how the Linux kernel looks at memory. The kernel keeps an array of every single physical frame of memory in an array of struct pages in a global variable called mem_map. Now the kernel has to be able to easily (i.e. quickly) translate a struct page to a physical address in memory; the simplest way would be to to say, in effect, "the difference between the address of this page_struct I am trying to find the physical address of and the address of mem_map (remember, it's a linear array, so mem_map is the base) is the index of the physical frame in memory. I know that frames are X bytes long, so the physical address is X * index".

If we keep mem_map as a linear array as above, we hit a serious problem with our memory layout from the zx1 chipset. Say we have 4 gigabytes of RAM, pushing us into Memory 1. To use the simple indexing scheme of mem_map described above, there would need to be a page_struct for each and every frame of memory between Memory 0 and Memory 1. This expands the range from 0x000 0000 0000 - 0x041 0000 0000. Say each struct page in the array handles a frame of 16 kilobytes, we need 17,039,360 entries, say each entry is around 40 bytes that makes 650 megabytes of memory required for the mem_map array! That's a fair chunk of our 4 gigabytes really wasted on mapping empty space.

This is also an issue for systems that may not have such a large gap imposed by the chipset, but receive such a large gap by participating inmposed by the chipset, but receive such a large gap by participating in a NUMA (non-uniform memory access) system. In a NUMA system, many individual nodes "pool" their memory into one large memory image. In this case, each node of the system will have gaps in their memory that is actually physically residing on another, remote, node. The upshot is that they will suffer from the same wasted space.

VIRTUAL_MEM_MAP and discontiguous memory support are some kernel options that get around this. By mapping the mem_map virtually, you can leave the mem_map sparse and only fill in what you need. Discontiguous memory support is a further option to allocate the memory map efficiently over a number of nodes.

Fun with floating point

Mon 11 April 2005

You probably already know that a floating point number is represented in binary as sign * significand * radixexponent. Thus you can represent the number 20.23 with radix 10 (i.e. base 10) as 1 * .2023 * 102 or as 1 * .02023 * 103.

Since one number can be represented in different ways, we define we defined the normalised version as the one that satisfies `` 1/radix <= significand < 1``. You can read that as saying "the leftmost number in the significand should not be zero".

So when we convert into binary (base 2) rather than base 10, we are saying that the "leftmost number should not be zero", hence, it can only be one. In fact, the IEEE standard "hides" the 1 because it is implied by a normalised number, giving you an extra bit for more precision in the significand.

So to normalise a floating point number you have to shift the signifcand left a number of times, and check if the first digit is a one. This is something that the hardware can probably do very fast, since it has to do it a lot. Combine this with an architecture like IA64 which has a 64 bit significand, and you've just found a way to do a really cool implementation of "find the first bit that is not zero in a 64 bit value", a common operation when working with bitfields (it was really David Mosberger who originally came up with that idea in the kernel).

#define ia64_getf_exp(x)                                        \
({                                                              \
        long ia64_intri_res;                                    \
                                                                \
        asm ("getf.exp %0=%1" : "=r"(ia64_intri_res) : "f"(x)); \
                                                                \
        ia64_intri_res;                                         \
})


int main(void)
{

    long double d = 0x1UL;
    long exp;

    exp = ia64_getf_exp(d);

    printf("The first non-zero bit is bit %d\n", exp - 65535);
}

Note the processor is using an 82 bit floating point implementation, with a 17 bit exponent component. Thus we use a 16 bit (0xFFFF, or 65535) bias so we can represent positive and negative numbers (i.e, zero is represented by 65535, 1 by 65536 and -1 by 65534) without an explicit sign bit.

IA64 uses the floating point registers in other interesting ways too. For example, the clear_page() implementation in the kernel spills zero'd floating point registers into memory because that provides you with the maximum memory bandwidth. The libc bzero() implementation does a similar thing.

rel v rela relocations

Thu 07 April 2005

A relocation is simply a record that stores

an address that needs to be resolved

information on how to resolve it; i.e.

a symbol name (actually a pointer to the symbol table, which then gives you the actual symbol)
the type of relocation; i.e. what to do. (this is defined by the ABI)

ELF defines two types of relocations

typedef struct {
  Elf32_Addr    r_offset;  <--- address to fix
  Elf32_Word    r_info;    <--- symbol table pointer and relocation type
} Elf32_Rel

typedef struct {
  Elf32_Addr    r_offset;
  Elf32_Word    r_info;
  Elf32_Sword   r_addend;
} Elf32_Rela

Thus RELA relocations have an extra field, the addend.

ianw@baci:~/tmp/fptest$ cat addendtest.c
extern int i[4];
int *j = i + 2;

ianw@baci:~/tmp/fptest$ cat addendtest2.c
int i[4];

ianw@baci:~/tmp/fptest$ gcc -nostdlib -shared -fpic -s -o addendtest2.so addendtest2.c

ianw@baci:~/tmp/fptest$ gcc -nostdlib -shared -fpic -o addendtest.so addendtest.c ./addendtest2.so

ianw@baci:~/tmp/fptest$ readelf -r ./addendtest.so

Relocation section '.rela.dyn' at offset 0x3b8 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
0000000104f8  000f00000027 R_IA64_DIR64LSB   0000000000000000 i + 8

So, sizeof(int) == 4 and j points two ints into i (i.e. i + 8). R_IA64_DIR64LSB is just defined as SYMBOL + ADDEND which means that to fixup this relocation, we need to write to Offset the value of i (which we find and resolve) and add 8 to it.

If you try this on a 386, you will not have the explicit addend as it will use a REL relocation. You will actually have to read the memory at Offset to find the addend (it will be 8).

Having to read the memory before you do the relocation has all sorts of inefficiencies, and the only positive is that it saves space (because with RELA you have that extra field and "blank" spot waiting to be filled in the binary; with REL you keep that addend in the "blank spot"). Most modern architectures have dispensed with REL relocations all together (IA64, PPC64, AMD64).