Buffering with streams

When you know about buffering it's trivial, but when you don't you can get some very unexpected behaviour from UNIX.

For example, the following program will print nothing.

#include

int main(void)
{
        printf("hello world");

        while (1)
                sleep(1);
}

This is because by default your stdout stream (which printf uses) will be in line buffering mode, where output will not be printed until a newline (\n) is seen.

You can either add the newline (as was probably intended, but it's an easy enough thing to leave off!) or manually call fflush to flush the buffer.

There are three buffer modes available; unbuffered, line and block buffered. You can access them with the setbuf call. Unbuffered is the default for stderr, while your tty by default takes on line buffering. Anything connected to a block device (e.g. a file on a disk, or pipes) take on block buffering, where the best amounts of data for the device are written at once.

But essentially, if your output is missing check for those newlines!

running X applications in your debootstrap chroot

Say you have just make a new chroot environment with debootstrap but now need to run X applications underneath.

The first thing is to not set your display to be simply :0.0 as this will try to use the UNIX socket in /tmp which probably isn't in your chroot.

You need your X server capable of listening on the network, which requires turning off -nolisten tcp. If you usually start X via the startx command you can edit out flag in the /etc/X11/xinit/xserverrc, but if you start from gdm you'll need to look at /etc/gdm/gdm.conf and have a look at the DisallowTCP option.

Once you have restarted your X server should be ready to listen for network connections. In your chroot set your DISPLAY variable to be localhost:0 to force the network socket connection.

At this point you need to sort out allowing access, the easy option is to do something like xhost +localhost to allow yourself to connect.

The only other problem was that debootstrap seems to not make all the devices required. If you see something like

host:/# xterm
xterm: Error 32, errno 2: No such file or directory
Reason: get_pty: not enough ptys

try running MAKEDEV pty in /dev to make the devices you need.

A very handy way of getting things into the chroot environment is to bind mount them. You can do mount --bind olddir newdir to make olddir look just like newdir. This will break through the chroot, which is at times great and at times a security risk. You can make scripts to bind mount in home directories (and even /tmp to avoid the unix socket problem).

sched_yield considered harmful

According to recent discussions sched_yield can cause performance problems with some code.

I haven't looked at the OpenLDAP code, but a common problem is that if you call pthread_cond_signal it doesn't actually relinquish control from the current thread to the thread (if any) that has been signaled. The thread that has been signalled has to wait for its next time slice before it will realise it has been woken up, and the signalling thread loops around the mutex for the conditional doing nothing. Thus a common idiom is to call pthread_cond_signal and then sched_yield to relinquish control from your thread to the waiting thread (I've certainly done this myself).

This can lead to very bad performance under certain circumstances since sched_yield on Linux is saying to the scheduler "I have nothing important to do". This may actually be the opposite of what is really true, as usually developers see this problem when there is so much work to do you don't get a chance to relinquish control and make some forward progress with it. This is especially bad on a SMP machine where you probably don't want to relinquish control at all, since you don't need to give up your timeslice for other threads to start running on other processors; what allows you forward progress on a uniprocessor might be slowing you down needlessly on a SMP machine. On some other systems sched_yield will relinquish control to another thread in your thread group, but this is not specified by the standard so you can not depend on it.

One solution to such a problem is presented below. This is a fairly standard situation where are request is made and needs to be handled. To avoid using sched_yield one needs to queue the requests for a dispatcher to handle and take an extra lock that it is possible to sleep on when required. On a uniprocessor, when requests are flooding in request() will just be looping queueing work until its timeslice is complete. The typical idiom is to place a yield after the unlock of dispatcher.mutex to allow the dispatcher to do some work. However by adding another lock you can make request sleep, giving the dispatcher time to clear the queue. You will notice if you include the usleep that everything happens nice and sequentially on uniprocessor or SMP; this is because the dispatcher gets a chance to run as the requests go to sleep. Conversely without the sleep the requests are flooding in as fast as possible; to keep your request handling latency reasonable you don't want to be giving up the CPU to every Tom, Dick and Harry every time you handle a new request!

Compile this once as

$ gcc -Wall -o server server.c -lpthread

to see the problem, then once as

$ gcc -Wall -o server-resched server.c -lpthread -DRESCHED

to see the dispatcher handling the batched requests. I believe this is a good solution to minimise latency below a CPU timeslice when requests are flooding, without having to "fool" the scheduler with sched_yield. Something to keep in mind is that if you find yourself using sched_yield when you may very well have more useful work to do, then chances are you're doing the wrong thing.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <errno.h>

struct request_tag_t {
    struct request_tag_t *next;
    int operation;
};

struct dispatcher_tag_t {
    struct request_tag_t *first;
    struct request_tag_t *last;
    int running;
    pthread_mutex_t mutex;
    pthread_cond_t request;
    pthread_mutex_t resched;
};

struct dispatcher_tag_t dispatcher = {
    .first = NULL,
    .last = NULL,
    .running = 0,
    .mutex = PTHREAD_MUTEX_INITIALIZER,
    .request = PTHREAD_COND_INITIALIZER,
    .resched = PTHREAD_MUTEX_INITIALIZER,
};

int rand_num(int range)
{
    return (int)((range)*1.0*rand()/(RAND_MAX+1.0));
}

void *dispatcher_routine(void *arg)
{
    struct request_tag_t *request;

    printf("dispatcher starting\n");

    while (1)
    {
        /* take the dispatcher queue mutex */
        pthread_mutex_lock(&dispatcher.mutex);

        /* if nothing to do, wait on the dispatcher queue
         * condition variable until there is */
        while (dispatcher.first == NULL)
            pthread_cond_wait(&dispatcher.request, &dispatcher.mutex);

        /* take a request from the queue */
        request = dispatcher.first;
        dispatcher.first = request->next;
        if (dispatcher.first == NULL)
            dispatcher.last = NULL;

        /* handle request, probably by handling it in a new thread. */
        printf("I'm dispatching request %d\n", request->operation);

        /* free request we have just finished with */
        free(request);

        /* release queue lock */
        pthread_mutex_unlock(&dispatcher.mutex);
#ifdef RESCHED
        /* release the resched lock to signal we've seen the queue */
        pthread_mutex_unlock(&dispatcher.resched);
#endif
    }
}

int tries = 1;

void request(int operation)
{
    struct request_tag_t *request;

    /* take the dispatcher queue mutex */
    pthread_mutex_lock(&dispatcher.mutex);

    /* start the thread if it isn't */
    if (!dispatcher.running)
    {
        pthread_t thread;
        printf("dispatcher not running, starting\n");
        dispatcher.running = 1;
        pthread_create(&thread, NULL, dispatcher_routine, NULL);
    }

    /* allocate a new request */
    request = malloc(sizeof(struct request_tag_t));
    if (request == NULL)
        exit(1);
    request->next = NULL;
    request->operation = operation;

    /* add to the queue, maintaing first and last */
    if (dispatcher.first == NULL) {
        dispatcher.first = request;
        dispatcher.last = request;
    } else {
        (dispatcher.last)->next = request;
        dispatcher.last = request;
    }

    /* signal something to do */
    pthread_cond_signal(&dispatcher.request);

    /* free the queue lock */
    pthread_mutex_unlock(&dispatcher.mutex);

    /* temping to put a sched_yield() here */

#ifdef RESCHED
    /* try the resched lock.  if we can't have it, that means
     * we've already got it and the dispatch thread hasn't had a
     * chance to run and unlock it.  We queue up to 10 requests
     * before allowing the dispatcher to run by sleeping on the
     * mutex
     */
    if (pthread_mutex_trylock(&dispatcher.resched) == EBUSY) {
        if (tries++ == 10) {
            pthread_mutex_lock(&dispatcher.resched);
            tries = 1;
        }
    } else
        tries = 1;
#endif
}

int main(void) {
    while(1)
    {
        int n = rand_num(3);
        printf("requesting %d\n", n);
        request(n);

        /* if you uncomment this, everything should play
         * nicely with one dispatch per request, because other
         * threads get a chance to run.  Otherwise we have
         * requests flooding in as fast as possible
         */

        /* usleep(500000); */
    }
}

A little tour of linux-gate.so

A few people have noticed and wondered what linux-gate.so.1 is in their binaries with newer libc's.

ianw@morrison:~$ ldd /bin/ls
        linux-gate.so.1 =>  (0xffffe000)
        librt.so.1 => /lib/tls/librt.so.1 (0xb7fdb000)
        libacl.so.1 => /lib/libacl.so.1 (0xb7fd5000)
        libc.so.6 => /lib/tls/libc.so.6 (0xb7e9c000)
        libpthread.so.0 => /lib/tls/libpthread.so.0 (0xb7e8a000)
        /lib/ld-linux.so.2 (0xb7feb000)
        libattr.so.1 => /lib/libattr.so.1 (0xb7e86000)

It's actually a shared library that is exported by the kernel to provide a way to make system calls faster. Most architectures have ways of making system calls that are less expensive than taking a full trap; sysenter on x86 (syscall on AMD I think) and epc on IA64 for example.

If you want the gist of how it works, first we can pull it apart. The following program reads and dumps the so on a x86 machine. Note it's just a kernel page, so you can just dump getpagesize() should you want to; though you can't directly call write on it (i.e. you need to memcpy and then write). Below I pull apart the headers.

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <elf.h>
#include <alloca.h>

int main(void)
{
  int i;
  unsigned size = 0;
  char *buf;

  Elf32_Ehdr *so = (Elf32_Ehdr*)0xffffe000;
  Elf32_Phdr *ph = (Elf32_Phdr*)((void*)so + so->e_phoff);

  size += so->e_ehsize + (so->e_phentsize * so->e_phnum);

  for (i = 0 ; i < so->e_phnum; i++)
    {
      size += ph->p_memsz;
      ph = (void*)ph + so->e_phentsize;
    }

  buf = alloca(size);
  memcpy(buf, so, size);

  int f = open("./kernel-gate.so", O_CREAT|O_WRONLY, S_IRWXU);

  int w = write(f, buf, size);

  printf("wrote %d (%s)\n", w, strerror(errno));

}

At this stage you should have a binary you can look at with, say readelf.

ianw@morrison:~/tmp$ readelf --symbols ./kernel-gate.so

Symbol table '.dynsym' contains 15 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
  [--snip--]
    11: ffffe400    20 FUNC    GLOBAL DEFAULT    6 __kernel_vsyscall@@LINUX_2.5
    12: 00000000     0 OBJECT  GLOBAL DEFAULT  ABS LINUX_2.5
    13: ffffe440     7 FUNC    GLOBAL DEFAULT    6 __kernel_rt_sigreturn@@LINUX_ 2.5
    14: ffffe420     8 FUNC    GLOBAL DEFAULT    6 __kernel_sigreturn@@LINUX_2.5

__kernel_vsyscall is the function you call to do the fast syscall magic. But I bet you're wondering just how that gets called?

It's easy if you poke inside the auxiliary vector that is passed to ld, the dynamic loader by the kernel. There's a couple of ways to see it; via an environment flag, peeking into /proc/self/auxv or on PowerPC it is passed as the forth argument to main().

ianw@morrison:~/tmp$ LD_SHOW_AUXV=1 /bin/true
AT_SYSINFO:      0xffffe400
AT_SYSINFO_EHDR: 0xffffe000
AT_HWCAP:    fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
AT_PAGESZ:       4096
AT_CLKTCK:       100
AT_PHDR:         0x8048034
AT_PHENT:        32
AT_PHNUM:        7
AT_BASE:         0xb7feb000
AT_FLAGS:        0x0
AT_ENTRY:        0x8048960
AT_UID:          1000
AT_EUID:         1000
AT_GID:          1000
AT_EGID:         1000
AT_SECURE:       0
AT_PLATFORM:     i686

Notice how the AT_SYSINFO symbols refers to the fast system call function in our kernel shared object? Also notice that the EHDR flag points to the library its self.

If you start to poke through the glibc source code and look how the sysinfo entry is handled you can see the dynamic linker will choose to use the library function for system calls if it is available. If that flag is never passed by the kernel it can fall back to the old way of doing things.

IA64 works in the same way, although we keep our kernel shared library at 0xa000000000000000. You can see how the shared object is quite an elegant design that allows maximum compatibility across and within architectures, since you have abstracted the calling mechanism away from userspace. A 386 can call the same way as a Pentium IV through the library and the kernel will make sure the appropriate thing is done in __kernel_vsyscall.

MAC Address to Producer String

Here's a little script that can tell you the producer assigned to a (globally unique) MAC address according to the officially designated allocations. With this you will be able to quantify the number of Apple users who can not follow instructions sufficiently to stop themselves tripping the network trip wire.

If you're googling for "mac address allocations" you really want the "Organizationally Unique Identifier" from here

#!/usr/bin/python

import os, pickle, sys

OUI_ALLOCATIONS="http://standards.ieee.org/regauth/oui/oui.txt"
CACHE_FILE="/home/ianw/.mac2prod.cache"

class MACCache:

    maccache = {}

    def __init__(self, recache):

        if (recache):
            self.setup_cache()
            return
        try:
            f = open(CACHE_FILE, "r")
            unpickler = pickle.Unpickler(f)
            self.maccache = unpickler.load()
        except:
            self.setup_cache()

    def setup_cache(self):
        f = os.popen("wget --quiet -O - " + OUI_ALLOCATIONS)
        for l in f.readlines():
            if l.find("(hex)") > 0:
                sl = l.split("\t")
                name = sl[2].strip()
                mac = sl[0][0:8].replace("-",":")
                self.maccache[mac] = name

        of = open(CACHE_FILE, 'w')
        pickler = pickle.Pickler(of)
        pickler.dump(self.maccache)


def usage():
    print "usage: mac2prod [-re-cache] 00:00:00:00:00:00"
    sys.exit(1)
if not len(sys.argv):
    usage()

if (sys.argv[1] == "-re-cache"):
    recache = True
    sys.argv.remove(sys.argv[1])
else:
    recache = False

if len(sys.argv[1]) != 17 or len(sys.argv[1].split(":")) != 6 :
    usage()

mc = MACCache(recache)

try:
    print mc.maccache[sys.argv[1][0:8].upper()]
except:
    print "Unknown Producer"

David v Goliath

With reference to Peter Hardy's day. Luckily these machines are smart enough to turn off when the ambient temperature gets to about 50oC!

David v Goliath (AKA Desk Fan v Itanium2)

Spam cleaning an mbox file

The problem is that I have a large mbox file full of messages which I want to strip spam from with spamassassin. Ordering of the messages is important (i.e. they must go in and come out the other end in order, with identified spam removed).

My solution is to run the following script from formail thusly

$ formail < in-mbox -s /home/ianw/spam-clean.py spam-free-box
#!/usr/bin/python
import os, sys, popen2

if len(sys.argv) != 2:
    print "usage: spam-clean.py output"
    sys.exit(1)

#read in message from stdin.
message = sys.stdin.read()

sa = popen2.Popen3("/usr/bin/spamc -c")
sa.tochild.write(message)
sa.tochild.close()

if sa.wait() != 0:
    print "discarding spam ..."
    sys.exit(1)
else:
    print "ok ..."
    f = open(sys.argv[1], 'a')
    f.write(message)
    f.close()

It's slow, but a little profiling shows me that most of the time is spent asleep (i.e. in the wait() call). One neat project would be to daemonise this and take messages in a fifo manner so we could run a few spamassassins at the same time but maintain message order.

Convert an IP address to hexadecimal

Here is a python script to convert IP addresses into hexadecimal, which may be required to name files for your bootloader if you are trying to netboot, for example. You can specify a mask if you have a large group of machines on a network (e.g. 10.1.3.2 with a mask of 24 will just give you 0x0A == 10d, while a mask of 16 gives you 0xOA01).

import re
import sys
import socket

if (not len(sys.argv) == 2):
    print "Usage: ip2hex.py hostname|ip address/mask"
    sys.exit(1)

try:
    (in_str, mask) = sys.argv[1].split("/")
    # sanity check mask
    mask = int(mask)
    if (mask > 32 or mask < 0):
        print "Mask out of range"
        sys.exit(1)
except ValueError:
    mask = 0
    in_str = sys.argv[1]

try:
    ip_addr = socket.gethostbyname(in_str)
except:
    print "Invalid address!"
    sys.exit(1)

#gethostbyname really checks this for us, but you never know
ip_regex = re.compile('(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.' \
                      '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.' \
                      '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.' \
                      '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)')
ip_match = ip_regex.match(ip_addr)

if (ip_match == None):
    print "Invalid address"
    sys.exit(1)

hex_ip_addr = 0
for i in range(1,5):
    hex_ip_addr += int(ip_match.group(i)) << (4-i)*8

fmt = "%%0%dX" % ((32 - mask) / 4)
print fmt % (hex_ip_addr >> mask)