Go L4!

Fri 20 November 2009

By now everybody has now heard about Go, Google's expressive, concurrent, garbage collecting language. One big, glaring thing stuck out at me when I was reading the documentation:

Do not communicate by sharing memory; instead, share memory by communicating.

One of the examples given is a semaphore using a channel, which I'll copy here for posterity.

var sem = make(chan int, MaxOutstanding)

func handle(r *Request) {
    sem <- 1;    // Wait for active queue to drain.
    process(r);  // May take a long time.
    <-sem;       // Done; enable next request to run.
}

func Serve(queue chan *Request) {
    for {
        req := <-queue;
        go handle(req);  // Don't wait for handle to finish.
    }
}

Here is a little illustration of that in operation.

Serve creates goroutines via the go keyword; each of which tries to get a slot in the channel. In the example there are only 3 slots, so it acts like a semaphore of count 3. When done, each thread returns its slot to the channel, which allows anyone blocked to be woken and continued.

This instantly reminded me of the very first thing you need to do if you ever want to pass Advanced Operating Systems -- write a semaphore server to provide synchronisation within your OS.

In L4, threads communicate with each other via inter-process communication (IPC). IPC messages have a fixed format - you specify a target thread, bundle some data into the available slots in the IPC format and fire it off. By default you block waiting for a reply -- this all happens within a single call for efficiency. On the other side, you can write servers who are listening for remote IPC connections, where everything happens in reverse.

Here's another illustration the of the trivial semaphore server concept Shehjar and I implemented.

Look familiar? Instead of a blocking push of a number into a slot into a channel, you make a blocking IPC call to a remote server.

My point here is that both take the approach of sharing memory via communication. When using IPC, you bundle up all your information into the available slots in the IPC message and send it. When using a channel, you bundle your information into an entry in the channel and call your goroutine. Receiving the IPC is the same as draining a channel - both result in you getting the information that was bundled into it by the caller.

IPC	Go
Start thread	Start goroutine
New thread blocks listening for IPC message	Goroutine blocks draining empty channel
Bundle information into IPC message	Bundle data into type of your channel
Send IPC to new thread	Push data into channel
Remote thread unbundles IPC	goroutine drains channel and gets data

Whenever you mention the word "microkernel", people go off the deep-end and one thing they seem to forget about is the inherent advantages of sharing state only via communication. As soon as you do that, you've broken open an amazing new tool for concurrency, which is now implicitly implied. By communicating via messages/channels rather than shared global state, it doesn't matter where you run! One of those threads in the example could be running on another computer in your cloud, marshalling up it's IPC messages/channel entries and sending them over TCP/IP -- nobody would care!

At any rate, do not communicate by sharing memory; instead, share memory by communicating is certainly an idea whose time has come.

Django toolchain on Debian

Fri 11 September 2009

Although Django is well packaged for Debian, I've recently come to the conculsion that the packages are really not what I want. The problem is that my server runs Debian stable, while my development laptop runs unstable, and Django revisions definitely fall into the "unstable" category. There really is no way to use a system Django 1.1 on one side, and a system Django 1.0 on the other.

After a bit of work, I think I've got something together that works, and I post it here in the hope it is useful for someone else. This info has been gleaned from similar references such as this <http://www.danceric.net/2009/03/26/django-virtualenv-and-mod_wsgi/> and this <http://www.saltycrane.com/blog/2009/05/notes-using-pip-and-virtualenv-django/>.

This is aimed at running a server using Debian stable (5.0) for production and an unstable environment for development. You actually need both to get this running. This is based on a project called "project" that lives in /var/www

First step is to install python-virtualenv on both.
Create a virtualenv on both, using the --no-site-packages to make it a stand-alone environment. This is like a chroot for python.
```
$ virtualenv --no-site-packages project
New python executable in project/bin/python
Installing setuptools............done.
```
The unstable environment has a file you'll need to copy into the stable environment - bin/activate_this.py. The stable version of python-virtualenv isn't recent enough to include this file, but you need it to essentially switch the system python into the chrooted environment. This will come in handy later when setting up the webserver.
There are probably better ways to keep the two environments in sync, but I simply take a manual approach of doing everything twice, once in each. So from now on, do the following in both environments.

Activate the environment

/var/www$ cd project
/var/www/project$ . bin/activate
(project) /var/www/project$

Use easy_install to install pip

(project) /var/www/project$ easy_install pip
Searching for pip
Reading http://pypi.python.org/simple/pip/
Reading http://pip.openplans.org
Best match: pip 0.4
Downloading http://pypi.python.org/packages/source/p/pip/pip-0.4.tar.gz#md5=b45714d04f8fd38fe8e3d4c7600b91a2
Processing pip-0.4.tar.gz
Running pip-0.4/setup.py -q bdist_egg --dist-dir /tmp/easy_install-Wu9O-U/pip-0.4/egg-dist-tmp-xjSdxq
warning: no previously-included files matching '*.txt' found under directory 'docs/_build'
no previously-included directories found matching 'docs/_build/_sources'
zip_safe flag not set; analyzing archive contents...
pip: module references __file__
Adding pip 0.4 to easy-install.pth file
Installing pip script to /var/www/project/bin

Installed /var/www/project/lib/python2.5/site-packages/pip-0.4-py2.5.egg
Processing dependencies for pip
Finished processing dependencies for pip

Install setuptools, also using easy_install (for some reason, pip can't install it). There is a trick here, you need to specify at least version 0.6c9 or there will be issues with the SVN version on Debian stable when you try to get Django in the next step.

(project) /var/www/project$ easy_install setuptools==0.6c9
Searching for setuptools==0.6c9
Reading http://pypi.python.org/simple/setuptools/
Best match: setuptools 0.6c9
Downloading http://pypi.python.org/packages/2.5/s/setuptools/setuptools-0.6c9-py2.5.egg#md5=fe67c3e5a17b12c0e7c541b7ea43a8e6
Processing setuptools-0.6c9-py2.5.egg
Moving setuptools-0.6c9-py2.5.egg to /var/www/project/lib/python2.5/site-packages
Removing setuptools 0.6c8 from easy-install.pth file
Adding setuptools 0.6c9 to easy-install.pth file
Installing easy_install script to /var/www/project/bin
Installing easy_install-2.5 script to /var/www/project/bin

Installed /var/www/project/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg
Processing dependencies for setuptools==0.6c9
Finished processing dependencies for setuptools==0.6c9

Create a requirements.txt with the path to the Django SVN for pip to install, then and then install it.

(project) /var/www/project$ cat requirements.txt
-e svn+http://code.djangoproject.com/svn/django/tags/releases/1.0.3/#egg=Django
(project) /var/www/project$ pip install -r requirements.txt
Obtaining Django from svn+http://code.djangoproject.com/svn/django/tags/releases/1.0.3/#egg=Django (from -r requirements.txt (line 1))
  Checking out http://code.djangoproject.com/svn/django/tags/releases/1.0.3/ to ./src/django

(project) /var/www/project$ pip install -r requirements.txt
Obtaining Django from svn+http://code.djangoproject.com/svn/django/tags/releases/1.0.3/#egg=Django (from -r requirements.txt (line 1))
  Checking out http://code.djangoproject.com/svn/django/tags/releases/1.0.3/ to ./src/django
... so on ...

Almost there! You can keep installing more Python requirements with pip if you need, but we've got enough here to start.
Create a file in /var/www/project called project-python.py. This will be the Python interpreter the webserver uses, and basically exec's itself into the virtalenv. The file should contain the following:
```
activate_this = "/var/www/project/bin/activate_this.py"
execfile(activate_this, dict(__file__=activate_this))

from django.core.handlers.modpython import handler
```
Now it's time to start the Django project. I like to create a new directory called project, which will be the parent directory kept in the SCM with all the code, media, templates, database (if using SQLite) etc. In this way to keep the two environments up-to-date I simply svn ci on one side, and svn co on the other.
```
(project) /var/www/project$ mkdir project
(project) /var/www/project/project$ mkdir db django media www
(project) /var/www/project/project$ cd django/
(project) /var/www/project/project/django$ django-admin startproject myproject
```

Last step now is to wire-up Apache to serve it all up. The magic is making sure you specify the correct PythonHandler that you made before to use the virtualenv, and include the right paths so you can find it and all the required Django settings.

DocumentRoot /var/www/project

<Location "/">
    SetHandler python-program
    PythonHandler project-python
    PythonPath "['/var/www/project/','/var/www/project/project/django/'] + sys.path"
    SetEnv DJANGO_SETTINGS_MODULE myproject.settings
    PythonDebug On
</Location>

Alias /media /var/www/project/project/media
<Location "/media">
    SetHandler none
</Location>
<Directory "/var/www/project/project/media">
    AllowOverride none
    Order allow,deny
    Allow from all
    Options FollowSymLinks Indexes
</Directory>

With all this, you should be up and running in a basic but stable environment. It's easy enough to update packages for security fixes, etc via pip after activating your virtualenv.

SIGTTOU and switching to canonical mode

Fri 21 August 2009

Here's an interesting behaviour that, as far as I can tell, is completley undocumented, sightly consfusing but fairly logical. Your program should receive a SIGTTOU when it is running in the background and attempts to output to the terminal -- the idea being that you shouldn't scramble the output by mixing it in while the shell is trying to operate. Here's what the bash manual has to say

Background processes are those whose process group ID differs from the
terminal's; such processes are immune to key- board-generated signals.
Only foreground processes are allowed to read from or write to the
terminal.  Background processes which attempt to read from (write to)
the terminal are sent a SIGTTIN (SIGTTOU) signal by the terminal
driver, which, unless caught, suspends the process.

So, consider the following short program, which writes some output and catches any SIGTTOU's, with an optional flag to switch between canonical and non-canonical mode.

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <termios.h>
#include <unistd.h>

static void sig_ttou(int signo) {
   printf("caught SIGTTOU\n");
   signal(SIGTTOU, SIG_DFL);
   kill(getpid(), SIGTTOU);
}

int main(int argc, char *argv[]) {

   signal(SIGTTOU, sig_ttou);

   if (argc != 1) {
      struct termios tty;

      printf("setting non-canoncial mode\n");
      tcgetattr(fileno(stdout), &tty);
      tty.c_lflag &= ~(ICANON);
      tcsetattr(fileno(stdout), TCSANOW, &tty);
   }

   int i = 0;
   while (1) {
      printf("  *** %d ***\n", i++);
      sleep(1);
   }
}

This program ends up operating in an interesting manner.

Run in the background, canonical mode : no SIGTTOU and output gets multiplexed with shell.
```
$ ./sigttou &
  *** 0 ***
[1] 26171
$   *** 1 ***
  *** 2 ***
  *** 3 ***
```

Run in the background, non-canonical mode : SIGTTOU delivered

$ ./sigttou 1 &
[1] 26494
ianw@jj:/tmp$ setting non-canoncial mode
caught SIGTTOU


[1]+  Stopped                 ./sigttou 1

Run in the background, canonical mode, tostop set via stty : SIGTTOU delivered, seemingly after a write proceeds

$ stty tostop
$ ./sigttou &
[2] 26531
ianw@jj:/tmp$   *** 0 ***
caught SIGTTOU


[2]+  Stopped                 ./sigttou

You can see a practical example of this by comparing the difference between cat file & and more file &. The semantics make some sense -- anything switching off canonical mode is like to be going to really scramble your terminal, so it's good to stop it and let it's terminal handling functions run. I'm not sure why canoncial background is considered useful mixed in with your prompt, but someone, somewhere must have decided it was so.

Update: upon further investigation, it is the switching of terminal modes that invokes the SIGTTOU. To follow the logic through more, see the various users of tty_check_change in the tty driver.

Using frozen chocolate to visualise microwave heat distribution

Thu 23 July 2009

My attempt at answering that most important of questions : where should one place their plate in the microwave to achieve maximal heating?

Review : The Race for a New Game Machine

Wed 15 July 2009

I recently finished The Race for a New Game Machine: Creating the Chips Inside the XBox 360 and the Playstation 3 (David Shippy and Mickie Phipps); an interesting insight into the processor development process from some of the lead architects.

The executive summary is : Sony, Toshiba and IBM (STI) decided to get together to create the core of the Playstation 3 — the Cell processor. Sony, with their graphics and gaming experience, would do the Synergistic Processing Elements; extremely fast but limited sub-units specialising in doing 3D graphics and physics work (i.e. great for games). IBM would do a Power based core that handled the general purpose computing requirements.

The twist comes when Microsoft came along to IBM looking for the Xbox 360 processor, and someone at IBM mentioned the Power core that was being worked on for the Playstation. Unsurprisingly, the features being built for the Playstaion also interested Microsoft, and the next thing you know, IBM is working on the same core for Microsoft and Sony at the same time, without telling either side.

This whole chain of events makes for a very interesting story. The book is written for a general audience, but you'll probably get the most out of it if you already have some knowledge of computer architecture; if you're trying to understand some of the concepts referred to from the two line descriptions you'll get a bit lost (H&P it is not).

The only small criticism is that it sometimes falls into reading a bit like a long LinkedIn recommendation. However, the book is very well paced, and throws in just enough technical tidbits amongst the corporate and personal dramas to make it a very fun read.

One thing that is talked about a bit is the fan-out of four (FO4) metric used in the designers quest to push the chip as fast as possible (and, as mentioned many times in the book, faster than what Intel could do!). I thought it might be useful to expand on this interesting metric a bit.

FO4

One problem facing chip architects is that, thanks to Moore's Law, it is hard to find a constant to compare design versus implementation. For example, you may design an amazing logic-block to factor large integers into products of prime numbers, but somebody else with better fabrication facilities might be able to essentially brute-force a better solution by producing faster hardware using a much less innovative design.

Some metric is needed that can compare the two designs discounting who has the better fabrication process. This is where the FO4 comes in.

When you change the input to a logic gate, it is not like it magically flips the output to the correct level instantaneously. There is a latency while everything settles to its correct level — the gate delay. The more gates connected to the output of a gate the more current required, which has additional effects on latency. The FO4 latency is defined as the time required to flip an inverter gate connected to (fanned-out) to four other inverter gates.

Thus you can describe the latency of other logic blocks in multiples of FO4 latencies. As this avoids measuring against wall-time it is an effective description of the relative efficiency of logic designs. For example, you may calculate that your factoriser has a latency of 100 FO4. Just because someone else's 200 FO4 factoriser gets a result a few microseconds faster thanks to their fancy ultra-low-FO4-latency fabrication process, you can still show that your design, at least a priori, is better.

The book refers several times to efforts to reduce the FO4 of the processor as much as possible. The reason this is important in this context is that the maximum latency on the critical path will determine the fastest clock speed you can run the processor at. For reasons explained in the book high clock speed was a primary goal, so every effort had to be made to reduce latencies.

All modern processors operate as a production line, with each stage doing some work and passing it on to the next stage. Clearly the slowest stage determines the maximum speed that the production line can run at (weakest link in the chain and all that). For example, if you clock at 1Ghz, that means each cycle takes 1 nanosecond (1s / 1,000,000,000Hz). If you have a F04 latency of say, 10 picoseconds, that means any given stage can have a latency of no more than 100 FO4 — otherwise that stage would not have enough time to settle and actually produce the correct result.

Thus the smaller you can get the FO4 latencies of your various stages, the higher you can safely up the clock speed. One way around long latencies might be to split-up your logic into smaller stages, making a much longer pipeline (production line). For example, split your 100 FO4 block into two 50 FO4 stages. You can now clock the processor higher, but this doesn't necessarily mean you'll get actual results out the end of the pipeline any faster (as Intel discovered with the Pentium 4 and it's notoriously long pipelines and corresponding high clock rates).

Of course, this doesn't even begin to describe the issues with superscalar design, instruction level parallelism, cache interaction and the myriad of other things the architects have to consider.

Anyway, after reading this book I guarantee you'll have an interesting new insight the next time you fire-up Guitar Hero.

Dig Jazz Applet, V2

Mon 18 May 2009

It seems the ABC updated the DIG Jazz now-playing list format, breaking V1. Some quick flash disassembly and a bit of hacking, and order is restored. As a bonus, it now shows the upcoming songs.

Source or Debian package.

Quickly describing hash utilisation

Thu 07 May 2009

I think the most correct way to describe utilisation of a hash-table is using chi-squared distributions and hypothesis and degrees of freedom and a bunch of other things nobody but an actuary remembers. So I was looking for a quick method that was close-enough but didn't require digging out a statistics text-book.

I'm sure I've re-invented some well-known measurement, but I'm not sure what it is. The idea is to add up the total steps required to look-up all elements in the hash-table, and compare that to the theoretical ideal of a uniformly balanced hash-table. You can then get a ratio that tells you if you're in the ball-park, or if you should try something else. A diagram should suffice.

Scheme for acquiring a hash-utilisation ratio

This seems to give quite useful results with a bare minimum of effort, and most importantly no tricky floating point math. For example, on the standard Unix words with a 2048 entry hash-table, the standard DJB hash came out very well (as expected)

Ideal 2408448
Actual 2473833
----
Ratio 0.973569

To contrast, a simple "add each character" type hash:

Ideal 2408448
Actual 6367489
----
Ratio 0.378241

Example code is hash-ratio.py. I expect this measurement is most useful when you have a largely static bunch of data for which you are attempting to choose an appropriate hash-function. I guess if you are really trying to hash more or less random incoming data and hence only have a random sample to work with, you can't avoid doing the "real" statistics.

Relocation truncated to fit - WTF?

Thu 12 March 2009

If you code for long enough on x86-64, you'll eventually hit an error such as:

(.text+0x3): relocation truncated to fit: R_X86_64_32S against symbol `array' defined in foo section in ./pcrel8.o

Here's a little example that might help you figure out what you've done wrong.

Consider the following code:

$ cat foo.s
.globl foovar
  .section   foo, "aw",@progbits
  .type foovar, @object
  .size foovar, 4
foovar:
   .long 0

.text
.globl _start
 .type function, @function
_start:
  movq $foovar, %rax

In case it's not clear, that would look something like:

int foovar = 0;

void function(void) {
  int *bar = &foovar;
}

Let's build that code, and see what it looks like

$ gcc -c foo.s

$ objdump --disassemble-all ./foo.o

./foo.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_start>:
   0:        48 c7 c0 00 00 00 00   mov    $0x0,%rax

Disassembly of section foo:

0000000000000000 <foovar>:
   0:        00 00          add    %al,(%rax)
   ...

We can see that the mov instruction has only allocated 4 bytes (00 00 00 00) for the linker to put in the address of foovar. If we check the relocations:

$ readelf --relocs ./foo.o

Relocation section '.rela.text' at offset 0x3a0 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000003  00050000000b R_X86_64_32S      0000000000000000 foovar + 0

The R_X86_64_32S relocation is indeed only a 32-bit relocation. Now we can tickle this error. Consider the following linker script, which puts the foo section about 5 gigabytes away from the code.

$ cat test.lds
SECTIONS
{
 . = 10000;
 .text : { *(.text) }
 . = 5368709120;
 .data : { *(.foo) }
}

This now means that we can not fit the address of foovar inside the space allocated by the relocation. When we try it:

$ ld -Ttest.lds ./foo.o
./foo.o: In function `_start':
(.text+0x3): relocation truncated to fit: R_X86_64_32S against symbol `foovar' defined in foo section in ./foo.o

What this means is that the full 64-bit address of foovar, which now lives somewhere above 5 gigabytes, can't be represented within the 32-bit space allocated for it.

For code optimisation purposes, the default immediate size to the mov instructions is a 32-bit value. This makes sense because, for the most part, programs can happily live within a 32-bit address space, and people don't do things like keep their data so far away from their code it requires more than a 32-bit address to represent it. Defaulting to using 32-bit immediates therefore cuts the code size considerably, because you don't have to make room for a possible 64-bit immediate for every mov.

So, if you want to really move a full 64-bit immediate into a register, you want the movabs instruction. Try it out with the code above - with movabs you should get a R_X86_64_64 relocation and 64-bits worth of room to patch up the address, too.

If you're seeing this and you're not hand-coding, you probably want to check out the -mmodel argument to gcc.