remap_file_pages example

Mon 02 May 2005

remap_file_pages modifies an existing mmap of a file to point to different pages. When you mmap a file, the first page in memory points to the first page of the file on disk, the second page in memory to the second page on disk, etc.

If you don't want this, e.g. you want the first page in memory to refer to some other page, you would usually have to multiple mmap operations. If you do this a lot, it can slow the kernel down keeping track of it all.

Instead, mmap the file and then use remap_file_pages to modify that mmap to point into different places in the file. In the example below, we create a temporary file, map it into memory and then "turn it around" ... that is the first page in memory is the last page on the disk, and so on.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <string.h>

#define DEFAULT_MMAP_SIZE (1024*1024*10) //10MB

#define PASSED 0
#define FAILED 1

static inline char page_hash(int page)
{
        return (char)(page % 256);
}

static inline char *page_offset_to_addr(char *start, int page)
{
        return start + (getpagesize() * page);
}

int genfile(char *file, const size_t size)
{
        ssize_t bytes = 0;
        int fd = 0;
        char *buf;
        int i, page;

        if (!mkstemp(file))
                return FAILED;

        fd = open(file, O_RDWR);
        if (fd == -1)
                return FAILED;

        buf = malloc( getpagesize() );

        printf("Writing out %d pages to %s\n", (int)(size / getpagesize()), file);

        for (page = 0 ; page < (size / getpagesize()) ; page++)
        {
                for (i = 0 ; i < getpagesize() ; i++)
                        buf[i] = page_hash(page);

                bytes += write(fd, buf, getpagesize());
        }
        close(fd);
        sync();
        return PASSED;
}

int compare(char *remapped_file, char *file, unsigned long size)
{
        char *mmap_orig_file;
        int i = 0, fd = open(file, O_RDONLY);
        int err = FAILED;

        if (!remapped_file || fd == -1)
                return FAILED;

        /* map in the file from disk, again */
        if ((mmap_orig_file =
             mmap(0, size, PROT_READ, MAP_SHARED, fd,
                  0)) == MAP_FAILED) {
                goto out_mmap_fail;
        }

        /* walk the original backwards and compare it to the remapped
         * file going forwards, page by page.  they should be the
         * same.
         */
        int cur_remap_page = 0;
        int cur_orig_page  = (size / getpagesize()) - 1;

        while (cur_orig_page >= 0)
        {
                printf("compare %05d -> %05d\r", cur_remap_page, cur_orig_page);
                if ((i = memcmp(page_offset_to_addr(mmap_orig_file, cur_orig_page),
                                page_offset_to_addr(remapped_file, cur_remap_page),
                                getpagesize()) != 0)) {
                        err = FAILED;
                        goto out;
                }
                cur_remap_page++;
                cur_orig_page--;
        }
        printf("\n");
        err = PASSED;
 out:
        munmap(mmap_orig_file, size);
 out_mmap_fail:
        close(fd);
        return err;
}

int main(void)
{
        int fd;
        const char *tmp_default = "/tmp/remap-testXXXXXX";
        size_t size = DEFAULT_MMAP_SIZE;
        char file[256], *addr;

        int err = FAILED;
        int sys_pagesize = getpagesize();

        strcpy(file, tmp_default);
        genfile(file, size);

        /*
         *  Map in the file
         */
        fd = open(file, O_RDWR);
        if ((addr =
             mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
                  0)) == MAP_FAILED)
                goto out;

        /* Turn the file around with remap_file_pages; that is the
         * last page of the file on disk becomes the first page in
         * memory, the second last page the second page in memory,
         * etc.
         */
        int cur_mmap_page = (size / sys_pagesize) - 1;
        int cur_file_page = 0;
        while (cur_mmap_page >= 0)
        {
                remap_file_pages(page_offset_to_addr(addr, cur_mmap_page),
                                 sys_pagesize, 0, cur_file_page, 0);
                cur_mmap_page--;
                cur_file_page++;
        }

        err = compare(addr, file, size);
        if (err == FAILED)
                printf("Test Failed!\n");
        else
                printf("Test Passed!\n");
 out:
        close(fd);
        unlink(file);
        exit(err);
}

Xnest for remote Gnome session

Fri 29 April 2005

One fairly nice way to get a remote gnome session with a minimum of fuss is to use Xnest. This just makes a little X server in a window.

Firstly, setup xauth to allow you through

ianw@lime:~$ echo $DISPLAY
:0.0
ianw@lime:~$ xauth list
localhost.localdomain:0  MIT-MAGIC-COOKIE-1  bd72c2....6c8ab8
lime/unix:0  MIT-MAGIC-COOKIE-1  9a5c6...648d4bffd
localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1  9a5c6b555f...bffd

Copy the cookie that corresponds to your current display to a new one for display :1

ianw@lime:~$ xauth add :1 MIT-MAGIC-COOKIE-1  bd72c2....6c8ab8

Then startup Xnest on display :1

ianw@lime:~$ Xnest -auth .Xauthority :1 &

Then fire up an xterm that ssh's to the remote box and starts up something like gnome-session

ianw@lime:~$ xterm -display :1 -e ssh tutti gnome-session

and hopefully it will look something like this!

It's always nice to get a new toy ...

Thu 28 April 2005

Especially when that toy comes from SGI

Our new Altix 350 ... there's a really big box here for you ...

The box New Altix 350, $lots. Shipping it to UNSW, $lots. Watching them play with the box ... priceless

Ubunutu IA64? Hopefully we can get this to boot so we can be super-cool like everyone else :)

bouncing via mutt

Tue 26 April 2005

I seem to get messages like Unable to deliver message to the following recipients, because the message was forwarded more than the maximum allowed times when I try to bounce messages via mutt. This is because by the time an email to ianw@ieee.org gets to my inbox it has bounced around a bunch of places like the IEEE and UNSW.

What I would really like is a filter-and-bounce function in mutt to filter out the headers. This involves minimal work from me (as opposed to, say, resending the message). mutt doesn't have this so I have hacked my own.

Firstly, in .muttrc add

set pipe_decode=no
macro index "B" "python2.3 /home/bin/bounce.py"

The pipe decode is important, since otherwise mime messages will get scrambled on the bounce and attachments won't come, etc.

Then just do the bounce with this python script

SENDMAIL = "/usr/sbin/sendmail"
FROMMAIL = "ianw@ieee.org"
SMTPHOST = mail.internode.on.net

import email, sys, smtplib, os

m = email.message_from_file(sys.stdin)
del m['received']

if len(sys.argv) == 2 :
        email = sys.argv[1]
        print "Bouncing to %s" % email
else:
        newstdin = os.open("/dev/tty", os.O_RDONLY)
        os.dup2(newstdin, 0)

        print "Email to send to :",
        sys.stdout.flush()
        email = sys.stdin.readline()

server = smtplib.SMTP(SMTPHOST)
server.sendmail(FROMMAIL, email, m.as_string())
server.quit()

The only tricky bit is having to re-open stdin because mutt sets up a pipe between it and the pipe-message process. You can add your own X-Resent-From type headers if you want.

magic sysrq via /proc

Tue 19 April 2005

If you can't type magic-sysrq for some reason, you can echo the magic sysreq key to /proc/sysrq-trigger for the same effect.

baci:~# echo t > /proc/sysrq-trigger
SysRq : Show State

                                                       sibling
  task                 PC          pid father child younger older
init          S a0000001006363c0     0     1      0     2               (NOTLB)

Call Trace:
 [] schedule+0xbf0/0x1280
                                sp=e00000407fe07d80 bsp=e00000407fe010e8
 [] schedule_timeout+0x100/0x1a0
                                sp=e00000407fe07d90 bsp=e00000407fe010b0
 [] do_select+0x270/0x4c0
                                sp=e00000407fe07dd0 bsp=e00000407fe00f80
 [] sys_select+0x590/0x8c0
                                sp=e00000407fe07df0 bsp=e00000407fe00ea0
 [] ia64_ret_from_syscall+0x0/0x20
                                sp=e00000407fe07e30 bsp=e00000407fe00ea0
migration/0   S a000000100081010     0     2      1             3       (L-TLB)

PPC to i386 Cross Compiler

Sat 16 April 2005

If you're planning on attending the kernel tute at LCA05 but have an Apple laptop, you might be interested in my cross compiler debian packages.

I would highly recommend you don't try try this yourself :) Originally I planned to write a little note on how to do it, but after about the third hour I'd forgotten what I'd hacked to get to the point I was at. If you do try, don't bother with gcc-3.4; it has some sort of heisenbug that segfaults the assembler in various ways. And just getting C and not g++/ada/pascal/java/etc was also harder than it should be.

Anyway, with the packages installed you need to build the kernel with two extra flags.

$ make CROSS_COMPILE=i386-linux- ARCH=i386

Other than that, the built binary boots fine for me in qemu on my iBook!

update: If you want to compile userspace apps, you will need to use tpkg-install-libc from toolchain-source.

Discontiguous Memory

Fri 15 April 2005

Understanding why we have discontiguous memory requires some understanding of the chipsets the power Itanium systems. The ZX1 chipset, which powers the HP Integrity line of servers, is one of the more common ones. Others chipsets would follow similar principles.

With the ZX1, the processor is not connected directly to RAM chips, but runs through a memory I/O controller or MIO for short. It looks something like

+------------------------+   Processor Bus                       Memory Bus    +-----------------------+
|  Itanium 2 Processor 1 | <--------+        +---------------+       +-------> |       RAM DIMM        |
+------------------------+          |------> |   ZX1 MIO     |<------|         +-----------------------+
|  Itanium 2 Processor 2 | <--------+        +---------------+       +-------> |       RAM DIMM        |
+------------------------+                    | | | | | | | |                  +-----------------------+
                                     Ropes->  | | | | | | | |
                                             +---------------+
                                             |    ZX1 IOA    |
                                             +---------------+
                                                |   |   |
                                               agp pci pci-x

From a top level view, the chipset is broadly divided into the System Bus Adapter (or SBA) and the Lower Bus Adapater (LBA). You can think of the SBA as an interface between the bus that the processors sits on and the ropes bus (below). The LBA is the PCI Host bus adapater, and contains the IO SAPIC (Streamlined Advanced Programmable Interrupt Controller, this chip handles interrupts from PCI devices, and can be thought of as a smart "helper" for the CPU that controls when it sees interrupts).

A rope is described as a "fast, narrow point-to-point connection between the zx1 mio and a rope guest device (the zx1 ioa). The rope guest device is responsible for bridging from the I/O rope to an industry standard I/O bus (PCI, AGP, and PCI-X)". The ZX1 chipset can support up to 8 ropes, which can be bundled.

This is distinct from the typical North bridge/South Bridge (where the north bridge connects to the CPU/Memory and south bridge, and the south bridge connects to the north bridge and slower I/O pehperials) that it used in 386 architectures. This does however introduce some more interesting timing problems.

A simplified map of the memory layout presented by the ZX1 chipset MIO looks like

0xFFF FFFF FFFF +-------------------------------------+
                |               Unused                |
0xF80 0000 0000 +-------------------------------------+
                |  Greater than 4GB MMIO :            |
                |   (not currently used by anything)  |
0xF00 0000 0000 +-------------------------------------+
                |               Unused                |
                                 ....
                |                                     |
0x041 0000 0000 +-------------------------------------+
                |           Memory 1 (3GB)            |
                |                                     |
                |                                     |
0x040 4000 0000 +-------------------------------------+
                |               Unused                |
0x040 0000 0000 +-------------------------------------+
                |           Memory 2 (252GB)          |
                                  ....
                |                                     |
0x001 0000 0000 +-------------------------------------+
                |  Less than 4GB MMIO :               |
                |   Access to firmware, processor     |
                |   reserved space, chipset registers |
                |   etc (3GB)                         |
0x000 8000 0000 +-------------------------------------+
                |           Virutal I/O (1GB)         |
0x000 4000 0000 +-------------------------------------+
                |           Memory 0 (1GB)            |
0x000 0000 0000 +-------------------------------------+

You can see it implements a 44 bit address space that is divided up into a range of different sections. The maximum memory it can theoretically support is 256GB (i.e. Memory 0 + Memory 1 + Memory 2) however physically I think most boxes only comes with enough slots for ~16Gb of memory. The ZX1 has an extension called the zx1 sme or Scalable Memory Extension that allows more memory to be delt with. 1.3. Looking at the layout in more depth 1.3.1. Virtual I/O

The ZX1 chipset implements an IO-MMU, or IO Memory Managment unit, for DMA. This is very similar to the standard MMU, or memory managment unit in that converts a virtual address into a physical address. This is particularly important for a card 32 bit card that can not understand a 64 bit address. By setting up the virtual I/O region under 4GB (i.e. the maximum address a 32 bit card can deal with) the IOMMU can translate the address back to a full 64 bit address and give the 32 bit card access to all memory. So the process goes something like

driver issues DMA request that is in an area above 4GB that a 32 bit card can not deal with
IO-MMU sets up a virtual address in the Virtual I/O region that points to the original address, and issues the DMA request with this low address
The device sends it's data back to the low address, which the IOMMU then translates back to the high address.

The IOTLB in the ZX1 chipset is only big enough to map 1GB at time, which is why this area is only 1GB. This means there can be contention for this space if many drivers are requesting large DMA transfers -- however note 64 bit capable cards can bypass this all together reducing the problem.

As you can see from the layout above, system RAM is not presented to the processor in a contiguous manner; that is all the memory in one large block. It is split up into three different sections (Memory 0, 1, 2) which are placed in different areas of the address space. Memory 0 needs to be around for legacy applications that expect certain things at low addresses. The rest of the < 4Gb area is taken up by the Low Memory Mapped IO region; the physical memory that would have been mapped here is moved into Memory 1, which is located at 0x040 0000 0000 (256Gb). Should there be more than 4GB of memory (i.e. Memory 0 + 1 = 4Gb) it will be allocated in Memory 2 at 0x001 0000 0000 (4Gb).

The other thing you need to know about is how the Linux kernel looks at memory. The kernel keeps an array of every single physical frame of memory in an array of struct pages in a global variable called mem_map. Now the kernel has to be able to easily (i.e. quickly) translate a struct page to a physical address in memory; the simplest way would be to to say, in effect, "the difference between the address of this page_struct I am trying to find the physical address of and the address of mem_map (remember, it's a linear array, so mem_map is the base) is the index of the physical frame in memory. I know that frames are X bytes long, so the physical address is X * index".

If we keep mem_map as a linear array as above, we hit a serious problem with our memory layout from the zx1 chipset. Say we have 4 gigabytes of RAM, pushing us into Memory 1. To use the simple indexing scheme of mem_map described above, there would need to be a page_struct for each and every frame of memory between Memory 0 and Memory 1. This expands the range from 0x000 0000 0000 - 0x041 0000 0000. Say each struct page in the array handles a frame of 16 kilobytes, we need 17,039,360 entries, say each entry is around 40 bytes that makes 650 megabytes of memory required for the mem_map array! That's a fair chunk of our 4 gigabytes really wasted on mapping empty space.

This is also an issue for systems that may not have such a large gap imposed by the chipset, but receive such a large gap by participating inmposed by the chipset, but receive such a large gap by participating in a NUMA (non-uniform memory access) system. In a NUMA system, many individual nodes "pool" their memory into one large memory image. In this case, each node of the system will have gaps in their memory that is actually physically residing on another, remote, node. The upshot is that they will suffer from the same wasted space.

VIRTUAL_MEM_MAP and discontiguous memory support are some kernel options that get around this. By mapping the mem_map virtually, you can leave the mem_map sparse and only fill in what you need. Discontiguous memory support is a further option to allocate the memory map efficiently over a number of nodes.

Death to trailing whitespace

Fri 15 April 2005

Whitespace at the end of lines is farily annoying, and wastes space. It also generally annoys other developers if you send patches that introduce white space.

Luckily, emacs can show you with big red blocks where you've left whitespace behind. I think everyone (who hasn't already :) should add something like

(mapc (lambda (hook)
      (add-hook hook (lambda ()
          (setq show-trailing-whitespace t))))
      '(text-mode-hook
      c-mode-hook
      emacs-lisp-mode-hook
      java-mode-hook
      python-mode-hook
      shell-script-mode-hook))

to their .emacs file right now.