Discussions with Matt Chapman lead me to thinking more about processors. I read up on Sun's Niagara (Ars Technica or IEEE Micro vol 25, issue 2). Notes which may serve as a short introduction below.
In essence code generally shows low instruction parallelism. This means the processor finds it hard to find things it can do in parallel because instructions more often than not end up depending on each other. However, server workloads show high thread level parallelism -- we've all written multithreaded applications because we realised we could do some parts of a program in parallel.
Rather than look for ILP, Niagara goes after TLP. It does this by grouping four threads into a thread group, which then execute on a "SPARC pipeline". With 8 SPARC pipelines you have 8*4 = 32 threads in flight. It's not far off to think of each SPARC pipeline like a hyper threaded Pentium processor; each of the threads has it's own registers but shares the processor (pipeline) back ends. Each pipeline is 6 stages (fetch, thread select, decode, execute, memory, write-back) with the first two stages replicated for each thread. ALU instructions and shift instructions have single cycle latency, whilst mulitply and divide are multiple cycle and cause a thread switch.
Under ideal circumstances where all threads are ready to go, the on chip thread scheduler will attempt to run each waiting thread in a round-robin fashion (i.e. last to execute is the next to run).
Pipelines have small 8K L1 cache which implements random replacement (rather than more common LRU type schemes, though the paper quotes ~10% miss rates). The pipelines (all of them) share an L2 cache of only 3MB, compare this to about 27Mb of cache for a Montecito Itanium (dual thread, dual core). The theory is that cache miss latency is hidden since when one thread misses and has to go back to memory you switch into another thread which is ready to go. For the same reason you can ditch branch prediction logic too; you just switch a new thread in while you take the branch.
There are a lot of registers to keep this all happening; there is a register window scheme where the current 8 registers are kept in "fast" registers whilst registers out of the current window frame are moved into slower SRAM registers. There is a 1-2 cycle latency to move between them.
The processor predicts loads as cache hits, and thus will execute speculatively other instructions depending on this load; though speculative instructions have a lower thread priority so the thread scheduler will do "real" work first. It doesn't do any fancy instruction re-ordering on the fly.
Cache coherency traffic over busses is eliminated by having all this running on a single core. A single package also keeps power down; all this requires a quoted 60W; Montecito quotes ~100W.
This has very little in common with a Pentium. Pentium breaks down instructions into micro-ops which the hardware then tries to extract as much ILP out of as possible. Hyperthreading has not shown great wins on the Pentium. Itanium moves the task of finding ILP to the compiler, and then provides a pipeline optimised for running bundles of instructions together. The Itanium has a lot more cache; Niagara trades this off for threads. Something like POWER is somewhere in the middle of all that.
So, is more threads or more cache better? Can workloads keep a processor doing 32 separate things to take advantage of all these threads? I would suggest that a both a desktop and technical workload will not; a busy webserver running JSP maybe. Time will tell.