valuable reading

If you only have one RSS feed I would suggest it should be the Freshmeat.net release announcements; I'm surprised bloglines tells me there's only 7 people subscribed to it. It's quite high volume, but I make time to skim it because it's simply a great way to keep up to date with what cool software people are developing. How else would you find out about genius like the wheel o' yum?

on structure alignment

C99 states (6.7.2.1 - 12)

Each non-bit-field member of a structure or union object is aligned in an implementation defined manner appropriate to its type.

For example, gcc 4 has changed the way structures line up on the stack for IA64.

#include

struct disk_stat {
        int fd;
        unsigned count;
};

int main(void)
{
        int blah;
        struct disk_stat test;

        printf("%p\n", &test);
}
ianw@lime:/tmp$ vi test.c
ianw@lime:/tmp$ gcc-4.0 -o test test.c
ianw@lime:/tmp$ ./test
0x60000fffff8eb480
0x60000fffff8eb484
ianw@lime:/tmp$ gcc-3.4 -o test test.c
ianw@lime:/tmp$ ./test
0x60000fffffafb470
0x60000fffffafb480

This is allowable because the two members of the structure (ints) only require 4 byte alignment. Although it may make for worse code; I guess the lesson is think about how your structure might be layed out and if required give it explicit alignment.

const with functions

gcc 4 now throws up a warning about functions defined with const; it appears to be taking the const as a qualifier on the return code.

ianw@lime:/tmp$ cat const.c
static const int const_function(int i)
{
    return i + 100;
}

void call_const_function(void)
{
        int b = const_function(20);
}
ianw@lime:/tmp$ gcc-4.0 -Wall -c const.c
const.c:2: warning: type qualifiers ignored on function return type
const.c: In function 'call_const_function':
const.c:11: warning: unused variable 'b'

When you define a function const you are telling the compiler that you don't examine anything but your arguments (i.e. no globals either) and have no side effects (other than what you return). This allows it to be smarter in compiling the code.

Type qualifiers (like const) are ignored for function return values, for reasons explained here.

Thus the way to indicate to gcc that the function is const is via __attribute__((const)).

Actually, I built two IA64 kernels, one with a const function properly attributed an another with it not and the code wasn't actually any different. But it might be, one day! I am told that in the past when a function is called twice from within the same function code improvements have been quantified.

A few notes from an NFS debugging session

We were seeing intermittent failures with NFS, particularly when a user would send a large file to the NFS client machine. From that point on, all access to the directory caused the accessing process to hang.

Analysis with tetherial showed that the NFS client was send many retransmits to the NFS server, which was never responding. As it happens, NFS uses UDP by default. Watching both ends of the connection, it became clear that packets were being dropped somewhere inbetween.

The solution to this was to mount NFS with TCP, rather than UDP, since we have no control over the intervening network and its (probably numerous) firewalls. To do this you need to make sure TCP/NFS is configured into your kernel, and then just specify the tcp option to mount.

If you're using automount for home directories and such, you might modify your auto.home file to look something like

--- auto.home   2005-07-15 16:46:57.000000000 +1000
+++ auto.home.new       2005-07-15 16:46:45.000000000 +1000
@@ -1 +1 @@
-*      eisbock.ken.nicta.com.au:/home/&
+*      -tcp    eisbock.ken.nicta.com.au:/home/&

The other solution was to tunnel the NFS connection via SSH, or maybe a VPN.

In summary; if you're asked to debug an unreliable NFS server, checking for UDP packet loss or switching over to TCP is a good place to start.

goodbye to an old friend

AlphaPC164

morrison, my old AlphaPC 164 has finally had it's last shutdown -h. Rescued on a mercy mission to Canberra it has served me well for many years. Sure, it had it's quirks; the ATX power supply modified with a paper clip, no off switch, noise and heat to rival a coal mine and an infuriating ability to not detect a keyboard and thus default to the serial port, rather than the VGA card (that one had me going for hours).

Ultimately it's 266MHz Alpha has been replaced by a processor almost 10 times its clock speed and half its word size, but not appreciably faster handling the mail and file serving tasks it did so well. It's tough to survive in a world when you can be replaced by a processor, all-in-one motherboard and RAM for just over $200.

Now I have to double check all the security alerts to make sure I'm not vulnerable to the latest stupid buffer overflow. But I do have a nice big RAID5 array to hold MP3's.

If anyone is interested in a slightly used Alpha with a few extras it could be exchanged for some form of alcohol.

Why does mdadm drop a drive when creating a RAID 5 array?

I've had the opportunity to play with software RAID over the last few days, and one thing that confused the heck out of me was why I created my fresh RAID 5 array with a few disks, and from the start one of the disks came up missing.

Eventually I found in the mdadm man page

Normally mdadm will not allow creation of an array with only one device, and will try to create a raid5 array with one missing drive (as this makes the initial resync work faster). With --force, mdadm will not try to be so clever.

The mdadm code enlightens me a bit more

/* If this is  raid5, we want to configure the last active slot
 * as missing, so that a reconstruct happens (faster than re-parity)

But I'm still wondering why. I think I understand from following the code (if you know the code, please correct any mis-assumptions). Firstly, md.c schedules a resync when it notices things are out of sync, like when an array is created and the drives are not marked in sync or when there is a missing drive and a spare.

md goes through the sectors in the disk one by one (doing a bit of rate limiting) and calls sync_request for the underlying RAID layer for each sector. For RAID5 this firstly converts the sector into a stripe and sets some flags on that stripe (STRIPE_SYNCING). Each stripe structure has a buffer for each disk in the array, amongst other things (struct stripe_head in include/linux/raid/raid5.h). It then calls handle_stripe to, well, handle the stripe.

/* maybe we need to check and possibly fix the parity for this stripe
 * Any reads will already have been scheduled, so we just see if enough data
 * is available
 */
if (syncing && locked == 0 &&
    !test_bit(STRIPE_INSYNC, &sh->state) && failed <= 1) {
    set_bit(STRIPE_HANDLE, &sh->state);
    if (failed == 0) {
        char *pagea;
        if (uptodate != disks)
            BUG();
        compute_parity(sh, CHECK_PARITY);
        uptodate--;
        pagea = page_address(sh->dev[sh->pd_idx].page);
        if ((*(u32*)pagea) == 0 &&
            !memcmp(pagea, pagea+4, STRIPE_SIZE-4)) {
            /* parity is correct (on disc, not in buffer any more) */
            set_bit(STRIPE_INSYNC, &sh->state);
        }
    }
    if (!test_bit(STRIPE_INSYNC, &sh->state)) {
        if (failed==0)
            failed_num = sh->pd_idx;
        /* should be able to compute the missing block and write it to spare */
        if (!test_bit(R5_UPTODATE, &sh->dev[failed_num].flags)) {
            if (uptodate+1 != disks)
                BUG();
            compute_block(sh, failed_num);
            uptodate++;
        }
        if (uptodate != disks)
            BUG();
        dev = &sh->dev[failed_num];
        set_bit(R5_LOCKED, &dev->flags);
        set_bit(R5_Wantwrite, &dev->flags);
        locked++;
        set_bit(STRIPE_INSYNC, &sh->state);
        set_bit(R5_Syncio, &dev->flags);
    }
}

We can inspect one particular part of that code that can get triggered if we're running a sync. If we don't have a failed drive (e.g. failed == 0) then we call into compute_parity to check the parity of the current stripe sh (for stripe header). If the parity is correct, then we flag the stripe as in sync, and carry on.

However, if the parity is incorrect, we will fall out to the next if statement. The first if checks if we have a failed drive; if we don't then we flag the failed drive as the disk with the parity for this stripe (sh->pd_idx). This is where the optimisation comes in -- if we have a failed drive then we pass it in directly to compute_block meaning that we will always be bringing the "failed" disk into sync with the other drives. We then set some flags so that the lower layers know to write out that drive.

The alternative is to update the parity, which in RAID5 is spread across the disks. This means you're constantly stuffing up nice sequential reads by making the disks seek backwards to write parity for previous stripes. Under this scheme, for a situation where parity is mostly wrong (e.g. creating a new array) you just read from the other drives and bring that failed drive into sync with the others. Thus the only disk doing the writing is the spare that is being rebuilt, and it is writing in a nice, sequential manner. And the other disks are reading in a nice sequential manner too. Ingenious huh!

I did run this by Neil Brown to make sure I wasn't completely wrong, and he pointed me to a recent email where he explains things. If anyone knows, he does!

Hey, you can see my house from here!

Hey, you can see my house from here! (at least until Google changes something).

update: it only took them a day to break it, but I did notice they embed a "Powered by Google" icon now and a link to the terms of use. The terms of use on the photo imagery seem to rule out a nice API being on the cards, but it would ceratinly open the door to some really cool stuff.

update 2: Egg on my face!

But in the terms and conditions you get if you click on the "Terms and conditions" link displayed on your map, it still says

The photographic imagery made available for display through Google maps is provided under a nonexclusive, non-transferable license for use only by you. You may not use the imagery in any commercial or business environment or for any commercial or business purposes for yourself or any third parties.

However, the API terms and conditions says

1.2 Photographic Imagery. The Google map images accessible to you through the Service may contain photographic imagery. Your use of this photographic imagery is limited to displaying it to end users within the Service itself, and in the same manner, form, format, and appearance as it is provided by the Service. You may not, nor may you allow others to, copy, distribute, display, alter, or otherwise use, this photographic imagery except as it is provided to you through the Service. Google reserves the sole right and discretion to determine whether your display of photgraphic images through the Service is in conformance with this Section, and also reserves the right to terminate or suspend your access to photographic imagery at any time for any reason, without notice.

Obviously the intent is that you can embed a map in your page but don't fiddle with it too much. But it would be nice if the terms and conditions were cleared up.

Looking at the Top500

I wanted to see how Itanium was placing in the Top500 against one of the main competitors, IBM Power, so I drew up a little table. In terms of Gflops/Processor Itanium stacks up very well against the competition. I believe I've identified the IBM power offerings for comparison (in blue).

Rmax Sum (Gflops)

Processors

Rmax/Processor

Power4+

97098

25194

3.85

PowerPC 440

364691

169984

2.15

Power4

31092

11552

2.69

PowerPC

64377

11636

5.53

Power5

35581

6216

5.72

Power

23574

26920

0.88

All Power

616413

251502

2.45

IBM Power

187345

69882

2.68

Itanium 2

237385

50668

4.69

Pentium 4 Xeon

398724

135348

2.95

Xeon EM64T

146050

35214

4.15

Intel

782159

221230

3.54

Note there are several minor generations of Itanium's lumped together, whilst the Power line gets a bit more granularity thanks to different versions. Of course the great thing about the Top500 results is you can get a different result depending on what sort of press release you want to write today, so the whole thing should be taken with a grain of salt.