I've had the opportunity to play with software RAID over the last few
days, and one thing that confused the heck out of me was why I created
my fresh RAID 5 array with a few disks, and from the start one of the
disks came up missing.
Eventually I found in the mdadm man page
Normally mdadm will not allow creation of an array with only one device,
and will try to create a raid5 array with one missing drive (as this
makes the initial resync work faster). With --force, mdadm will not try
to be so clever.
The mdadm code enlightens me a bit more
/* If this is raid5, we want to configure the last active slot
* as missing, so that a reconstruct happens (faster than re-parity)
But I'm still wondering why. I think I understand from following the
code (if you know the code, please correct any mis-assumptions).
Firstly, md.c schedules a resync when it notices things are out of
sync, like when an array is created and the drives are not marked in
sync or when there is a missing drive and a spare.
md goes through the sectors in the disk one by one (doing a bit of
rate limiting) and calls sync_request for the underlying RAID layer
for each sector. For RAID5 this firstly converts the sector into a
stripe and sets some flags on that stripe (STRIPE_SYNCING). Each
stripe structure has a buffer for each disk in the array, amongst other
things (struct stripe_head in include/linux/raid/raid5.h). It
then calls handle_stripe to, well, handle the stripe.
/* maybe we need to check and possibly fix the parity for this stripe
* Any reads will already have been scheduled, so we just see if enough data
* is available
*/
if (syncing && locked == 0 &&
!test_bit(STRIPE_INSYNC, &sh->state) && failed <= 1) {
set_bit(STRIPE_HANDLE, &sh->state);
if (failed == 0) {
char *pagea;
if (uptodate != disks)
BUG();
compute_parity(sh, CHECK_PARITY);
uptodate--;
pagea = page_address(sh->dev[sh->pd_idx].page);
if ((*(u32*)pagea) == 0 &&
!memcmp(pagea, pagea+4, STRIPE_SIZE-4)) {
/* parity is correct (on disc, not in buffer any more) */
set_bit(STRIPE_INSYNC, &sh->state);
}
}
if (!test_bit(STRIPE_INSYNC, &sh->state)) {
if (failed==0)
failed_num = sh->pd_idx;
/* should be able to compute the missing block and write it to spare */
if (!test_bit(R5_UPTODATE, &sh->dev[failed_num].flags)) {
if (uptodate+1 != disks)
BUG();
compute_block(sh, failed_num);
uptodate++;
}
if (uptodate != disks)
BUG();
dev = &sh->dev[failed_num];
set_bit(R5_LOCKED, &dev->flags);
set_bit(R5_Wantwrite, &dev->flags);
locked++;
set_bit(STRIPE_INSYNC, &sh->state);
set_bit(R5_Syncio, &dev->flags);
}
}
We can inspect one particular part of that code that can get triggered
if we're running a sync. If we don't have a failed drive (e.g.
failed == 0) then we call into compute_parity to check the
parity of the current stripe sh (for stripe header). If the parity
is correct, then we flag the stripe as in sync, and carry on.
However, if the parity is incorrect, we will fall out to the next if
statement. The first if checks if we have a failed drive; if we
don't then we flag the failed drive as the disk with the parity for this
stripe (sh->pd_idx). This is where the optimisation comes in -- if
we have a failed drive then we pass it in directly to compute_block
meaning that we will always be bringing the "failed" disk into sync with
the other drives. We then set some flags so that the lower layers know
to write out that drive.
The alternative is to update the parity, which in RAID5 is spread across
the disks. This means you're constantly stuffing up nice sequential
reads by making the disks seek backwards to write parity for previous
stripes. Under this scheme, for a situation where parity is mostly wrong
(e.g. creating a new array) you just read from the other drives and
bring that failed drive into sync with the others. Thus the only disk
doing the writing is the spare that is being rebuilt, and it is writing
in a nice, sequential manner. And the other disks are reading in a nice
sequential manner too. Ingenious huh!
I did run this by Neil Brown to make sure I wasn't completely wrong, and
he pointed me to a recent
email
where he explains things. If anyone knows, he does!