Exploring cmp8xchg16

The latest version of Itanium comes with a new instruction, cmp8xchg16. This does a compare and exchange on 8 bytes (64 bits) but exchanges 16 bytes (128 bits). Seems like a strange instruction?

Firstly, a bit of an overview of compare-and-exchange (CAS). As a function, it would be expressed compare_and_exchange(memory, new_value, expected_value). If the memory location contains the expected value, the new value is swapped in, atomically. This means nothing can interrupt us between the compare and the exchange.

This is useful for implementing lock-free algorithms, particularly the case of updating a linked-list stack. For example, below we show how you can safely insert an element into the linked list using compare and exchange.

compare and exchange push

Note that nothing can interrupt the update of the head of list pointer; if the value is incorrect we just loop around, update and try again. Eventually we hope we can do it uninterrupted.

However, removing an element from the list is not as simple.

compare and exchange pop

We could potentially be interrupted, and have the list change underneath us. As long as the elements move around in memory, the CAS will fail and we will be OK. But if the next element is in the same place as the "old" element (the one we are trying to pop) we have a problem. This is known as the "ABA" problem (or the "A-B-A" problem, or something else; if this still doesn't make sense try this page).

To avoid this, one common scheme is to have a version number on the head of list pointer. You do the swap, and compare the version number as well as the next pointer in memory. If they both match, you know that nothing has been updated and you are safe to pop the entry. This requires either a double wide compare and exchange instruction (so you can swap both a pointer and a version number) or tricky schemes like aligning the variables and using the base bits as version numbers (see An Almost Non-Blocking Stack, Bohem 2004).

Now, we can't just check the version with a normal width compare and swap and then update the pointer later, because we've just created our race condition again. But we don't actually need to check the pointer, just swap it in if the version is OK. Hence (I think) a cmp8xchg16 instruction. It allows you to check a version number, swap in a new pointer if it is all OK, but without (what I presume is significant) overhead of a full double wide compare and swap implementation. Just when to update the version I'm still thinking about; I think you could still use a blacklist scheme like the Bohem approach, but it needs further thought.

Comparing Montecito and Prescott cores

With the release of the Montecito manuals I wondered about the relative sizes of Itanium and Pentium.

Below is a quick hack comparison between a Montecito dual-core processor and a Prescott (early Pentium 4) core (I used this because both were made with a 90nm process). I've very roughly chopped off the L2 cache of the Prescott, and tried to superimpose it over the Itanium core. Remembering the extra yellow bits are L2 cache on the Itanium, and they roughly have the same L1 cache (16K L1I on both, 16K L1D on I2, 12K L1D on Prescott) the Itanium core logic comes out looking about 10-15% smaller. Montecito and Prescott

I got the Prescott die sizes from chip-architect.com and the Montecito sizes from the picture from Wikipedia article. If you want to check my maths, one pixel across is 0.0573mm and one pixel down is 0.575mm.

As you can see, the Itanium has a lot of cache. I will be interested to see how the Montecito stands up against the new Sun and IBM offerings over the next few months.

Terse guide to getting a cross compiler on Debian

It's not exactly point and click, but it does work, eventually. This talks about ia64, substitute your target architecture (.

  • apt-get source binutils.
  • Build with TARGET=ia64 fakeroot debian/rules binary-cross, install.
  • apt-get source gcc-4.0 and run GCC_TARGET=ia64 ./debian/rules control.
  • Now have a look at debian/control and take note of the pacakges with -ia64-cross. You need to make these with dpkg-cross. Download the packages from the archive and run dpkg-cross -a ia64 -b package.deb. Install. There is also a tool in unstable called apt-cross which makes this painless.
  • Try building gcc with GCC_TARGET=ia64 DEB_CROSS_INDEPENDENT=yes dpkg-buildpackage -rfakeroot. It will tell you if there are any missing dependencies.
  • Build should complete, and you should have ia64-linux-gnu-blah tools.

The only problem with this approach is that your packages depend on the gcc-4.0-base package, and if this is updated in the archives you need to rebuild all your cross compilers. Considering this package consists of readme files, this is slightly annoying.

Update: I have written a program get-from-pool.py to help you finding the files for the other architecture which you need to build with dpkg-cross.

Update 2: I have a patch that enables you to build packages without depending on the underlying system. See Debian bug #347484.

Update 3: updated for today's environment.

Horus

Our new Sun Fire V20z just arrived, and whilst looking around for the (GPL) source code for the Linux firmware of the onboard PowerPC based managment processor (which can be found by following the download link here), I realised it is just a re-badged Newisys box.

Further investigation lead to an interesting white paper on Horus, a large scale SMP chipset being developed by Newisys (the paper is a bit rough and doesn't appear to have been editied heavily, but worth a read). 4 Opteron processors connect to a Horus chip, which then link to form a larger cache coherent SMP system. Presumably they put four to a chip since four processors is about the sweet spot for the Opteron cache coherency protocol (cHT). Looks like it will scale up to 8-Quads, or 32 processors, talking over Infiniband. Supports dynamic partitioning too. Will be an interesting challenger and worth watching.

To save you looking up Horus he is the Egyptian God of the sun, with the head of a hawk.

Sun's Niagara

Discussions with Matt Chapman lead me to thinking more about processors. I read up on Sun's Niagara (Ars Technica or IEEE Micro vol 25, issue 2). Notes which may serve as a short introduction below.

In essence code generally shows low instruction parallelism. This means the processor finds it hard to find things it can do in parallel because instructions more often than not end up depending on each other. However, server workloads show high thread level parallelism -- we've all written multithreaded applications because we realised we could do some parts of a program in parallel.

Rather than look for ILP, Niagara goes after TLP. It does this by grouping four threads into a thread group, which then execute on a "SPARC pipeline". With 8 SPARC pipelines you have 8*4 = 32 threads in flight. It's not far off to think of each SPARC pipeline like a hyper threaded Pentium processor; each of the threads has it's own registers but shares the processor (pipeline) back ends. Each pipeline is 6 stages (fetch, thread select, decode, execute, memory, write-back) with the first two stages replicated for each thread. ALU instructions and shift instructions have single cycle latency, whilst mulitply and divide are multiple cycle and cause a thread switch.

Under ideal circumstances where all threads are ready to go, the on chip thread scheduler will attempt to run each waiting thread in a round-robin fashion (i.e. last to execute is the next to run).

Pipelines have small 8K L1 cache which implements random replacement (rather than more common LRU type schemes, though the paper quotes ~10% miss rates). The pipelines (all of them) share an L2 cache of only 3MB, compare this to about 27Mb of cache for a Montecito Itanium (dual thread, dual core). The theory is that cache miss latency is hidden since when one thread misses and has to go back to memory you switch into another thread which is ready to go. For the same reason you can ditch branch prediction logic too; you just switch a new thread in while you take the branch.

There are a lot of registers to keep this all happening; there is a register window scheme where the current 8 registers are kept in "fast" registers whilst registers out of the current window frame are moved into slower SRAM registers. There is a 1-2 cycle latency to move between them.

The processor predicts loads as cache hits, and thus will execute speculatively other instructions depending on this load; though speculative instructions have a lower thread priority so the thread scheduler will do "real" work first. It doesn't do any fancy instruction re-ordering on the fly.

Cache coherency traffic over busses is eliminated by having all this running on a single core. A single package also keeps power down; all this requires a quoted 60W; Montecito quotes ~100W.

This has very little in common with a Pentium. Pentium breaks down instructions into micro-ops which the hardware then tries to extract as much ILP out of as possible. Hyperthreading has not shown great wins on the Pentium. Itanium moves the task of finding ILP to the compiler, and then provides a pipeline optimised for running bundles of instructions together. The Itanium has a lot more cache; Niagara trades this off for threads. Something like POWER is somewhere in the middle of all that.

So, is more threads or more cache better? Can workloads keep a processor doing 32 separate things to take advantage of all these threads? I would suggest that a both a desktop and technical workload will not; a busy webserver running JSP maybe. Time will tell.

Looking at the Top500

I wanted to see how Itanium was placing in the Top500 against one of the main competitors, IBM Power, so I drew up a little table. In terms of Gflops/Processor Itanium stacks up very well against the competition. I believe I've identified the IBM power offerings for comparison (in blue).

Rmax Sum (Gflops)

Processors

Rmax/Processor

Power4+

97098

25194

3.85

PowerPC 440

364691

169984

2.15

Power4

31092

11552

2.69

PowerPC

64377

11636

5.53

Power5

35581

6216

5.72

Power

23574

26920

0.88

All Power

616413

251502

2.45

IBM Power

187345

69882

2.68

Itanium 2

237385

50668

4.69

Pentium 4 Xeon

398724

135348

2.95

Xeon EM64T

146050

35214

4.15

Intel

782159

221230

3.54

Note there are several minor generations of Itanium's lumped together, whilst the Power line gets a bit more granularity thanks to different versions. Of course the great thing about the Top500 results is you can get a different result depending on what sort of press release you want to write today, so the whole thing should be taken with a grain of salt.

x86 Architecture manuals from Intel

Someone once gave gave me a tip that thanks to some sort of anti-trust agreement, if you go to this site and do a search on "IA-32" by title, you can fill your shopping cart with manuals and Intel ship it out to you free.

(one reason) why gcc has a hard time on IA64

James Wilson was kind enough to point me to some of the problems that gcc has with the peculiarities of IA64. Take, for example, the following little function :

void function(int *a, int *b)
{
    *a = 1;
    *b = 2;
}

Now let's inspect the code that gcc generates for this on 386 and IA64

--- 386 ---
push   %ebp
mov    %esp,%ebp
mov    0x8(%ebp),%edx
movl   $0x3,(%edx)
mov    0xc(%ebp),%edx
movl   $0x4,(%edx)
pop    %ebp
ret

--- ia64 ---
[MMI]       mov r2=3;;
            st4 [r32]=r2
            mov r2=4;;
[MIB]       st4 [r33]=r2
            nop.i 0x0
            br.ret.sptk.many b0;;

Now those of you who know IA64 assembly will see that that function takes three cycles when it really only needs to take two. The ;; signifies a stop, which means that there is a dependency -- i.e. after the mov you need to put a stop before loading it into r32. This is called a "read after write" dependency; you have to tell the processor so that pipeline gets things right. On x86 you don't have to worry about that, as internally the processor breaks down instructions to smaller RISC operations and will schedule the two independently -- the whole point of IA64 and EPIC is that the compiler should figure this out for you.

What we could do is put the two mov instructions into separate registers (remember, we have plenty) and then stop, saving a cycle. Indeed, the Intel compiler does as good a job as one could have by hand :

[MII]       alloc r8=ar.pfs,2,2,0
            mov r3=3
            mov r2=4;;
[MMB]       st4 [r32]=r3
            st4 [r33]=r2
            br.ret.sptk.many b0;;

Now, if you're really good (like James, who did this) you can inspect the gcc optimisation paths (via the -d flags) and realise that at first gcc gets it right, putting the two loads before a stop. But the register optimisation pass jumps in and tries to reduce the lifespan of a registers with a constants in them, breaking up the instructions. On x86 you want this -- it reduces register pressure at the cost of instruction level parallelism; a trade off you are willing to take. On IA64, with plenty of registers and relying on ILP to run fast, it hurts.

People are starting to look at issues like this; but if you've ever peered under the hood of gcc you'll know that it's not for the faint of heart!