Fun with -funroll-loops

I was playing with loop unrolling, and decided to investigate it a little. The basis is this very simple program which looks like it could do with some loop unrolling.

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define SIZE (1500 * 1024 * 1024)

int fn(int *buf) {
        int i;

    for(i=0; i < (SIZE/sizeof(int)); i+=1)
        buf[i] = i;

    return 0;
}

int main(void)
{
        int *buf = malloc(SIZE);
        assert(buf);
    return fn(buf);
}

Firstly, lets have a look at what happens to the function when -funroll-loops is applied. Below is the de-compilation with some comments on what each line is doing.

Non-optimised

                         | nop.m 0x0
r14=0                    | mov r14=r0
save the "loop counter"  | mov.i r2=ar.lc
                         | nop.m 0x0
one less than iterations | movl r15=0x176fffff;;
                         | nop.m 0x0
initalise loop counter   | mov.i ar.lc=r15
                         | nop.i 0x0;;
                         | loop:
*buf = i, buf++          | st4 [r32]=r14,4
i++                      | adds r14=1,r14
dec lc, branch if > 0    | br.cloop.sptk.few .loop
                         | nop.m 0x0
restore loop counter     | mov.i ar.lc=r2

Loop unrolled

                         | nop.m 0x0
r26 = 0                  | mov r26=r0
save lc                  | mov.i r2=ar.lc
                         | nop.m 0x0
iter / 32                | movl r14=0x2edffff;;
                         | nop.m 0x0
setup loop counter       | mov.i ar.lc=r14
                         | nop.i 0x0
                         | loop:
r3 = buf                 | mov r3=r32
r8 = 1                   | adds r8=1,r26
r25 = 3                  | adds r25=3,r26
r23 = buf[3]             | adds r23=12,r32
r24 = 4                  | adds r24=4,r26
r22 = buf[4]             | adds r22=16,r32;;
buf[0] = 0; r3=buf+1     | st4 [r3]=r26,4
r21 = 5                  | adds r21=5,r26
r19 = buf[5]             | adds r19=20,r32
r20 = 6                  | adds r20=6,r26
r18 = buf[6]             | adds r18=24,r32
r16 = 7                  | adds r16=7,r26;;
                         | nop.m 0x0
buf[1] = 1, r3=buf+2     | st4 [r3]=r8,4
r15 = buf[7]             | adds r15=28,r32
r17 = 2                  | adds r17=1,r8
r26 += 8                 | adds r26=8,r26
buf += 8 (for next iter) | adds r32=32,r32;;
buf[2] = 2               | st4 [r3]=r17
buf[3] = 3               | st4 [r23]=r25
                         | nop.i 0x0
buf[4] = 4               | st4 [r22]=r24
buf[5] = 5               | st4 [r19]=r21
                         | nop.i 0x0
buf[6] = 6               | st4 [r18]=r20
buf[7] = 7               | st4 [r15]=r16
loop, dec branch         | br.cloop.sptk.few .loop
                         | nop.m 0x0
restore lc               | mov.i ar.lc=r2
                         | br.ret.sptk.many b0;;

We can see that in the first case the loop counter is setup to run 393216000 times (e.g. 1500*1024*1024/sizeof(int)) and a very simple loop of storing the value and incrementing is undertaken. In the second version, the code becomes more complex. We can see the loop now executes 49152000 times (e.g. 1500*1024*1024/sizeof(int)/8) so we can infer that for each loop iteration, 8 values have been unrolled to be written into our buf array. Indeed, when pulled apart we can see the 8 stores (numbers just reflect the first time through).

So, is it faster?

$ time ./loop

real    0m1.209s
user    0m0.388s
sys     0m0.820s

$ timme ./loop-unroll

real    0m1.056s
user    0m0.244s
sys     0m0.812s

Yes! Great. We can suggest a number of reasons why, but it would be interesting to get some hard statistics on why this is so. Thanks to the Itanium performance counters, we can. Firstly, lets start by seeing how cycles are being spent.

$ pfmon -e cpu_cycles,ia64_inst_retired_this,nops_retired,back_end_bubble_all ./loop
 547923609 CPU_CYCLES
1179995250 IA64_INST_RETIRED_THIS
    104431 NOPS_RETIRED
 154285534 BACK_END_BUBBLE_ALL

$ pfmon -e cpu_cycles,ia64_inst_retired_this,nops_retired,back_end_bubble_all ./loop-unroll
 311570590 CPU_CYCLES
1327451286 IA64_INST_RETIRED_THIS
 147560435 NOPS_RETIRED
  16088889 BACK_END_BUBBLE_ALL

Now, the Itanium can issue up to six instructions (IA64_INST_RETIRED_THIS) in two bundles of three instructions per cycle (CPU_CYCLES). However, you can't have dependencies between registers within a 3-instruction bundle (and branches and interruptions play havoc) so sometimes you need to pad with no-ops (this is why you need a good compiler). We can work out some figures from this, namely the useful instructions per cycle (e.g. instructions retired - nop instructions / cycles).

Test Useful instructions/cycle
loop 1179995250 - 104431 / 547923609 = 2.15
unroll 1327451286 - 147560435 / 311570590 = 3.78

This tells us that for each given cycle, the unrolled-loop version is doing more work (ideally we'd like that to be a the theoretical maximum of 6). So what is the unrolled version doing for all this time? Bubbles are where the CPU is waiting for something; a prime candidate is moving data from caches to registers. Bubbles make up (154285534/547823609) 28% of the cycles on the loop version, but only (16088889/311570590) 5% of the time for the unrolled version. This means the processor is spending a lot less time waiting for things to happen with the unrolled loop version.

We can drill down further to see what bubbles are occurring.

$ pfmon -e be_l1d_fpu_bubble_all,be_exe_bubble_all,be_RSE_bubble_all,Back_end_bubble_fe ./loop
 14229581 BE_L1D_FPU_BUBBLE_ALL
    97669 BE_EXE_BUBBLE_ALL
     1260 BE_RSE_BUBBLE_ALL
131354866 BACK_END_BUBBLE_FE

$ pfmon -e be_l1d_fpu_bubble_all,be_exe_bubble_all,be_RSE_bubble_all,Back_end_bubble_fe ./loop-unroll
22676169 BE_L1D_FPU_BUBBLE_ALL
   83064 BE_EXE_BUBBLE_ALL
    1263 BE_RSE_BUBBLE_ALL
  718850 BACK_END_BUBBLE_FE

We are now breaking the stalls down to see where they are occuring. L1D_FPU_BUBBLES are essentially related to micro-architectural issues with the caches; things like the L1D cache being overwhelmend and interactions with the hardware page-table walker. BE_EXE bubbles are more simply described a waiting for data to come from caches (or even main memory) to registers. BE_RSE bubbles are related to the register stack engine, and "front end" bubbles (BACK_END_BUBBLE_FE) are where there are not enough instructions coming from the "front-end" for the "back-end" to issue out to functional units in the processor.

In this case, the major cause of problems for the simple loop seems to be not being able to get enough instructions through to keep the processor busy. We can try to pin down what is happening in the front-end:

$ pfmon -e fe_bubble_imiss,fe_bubble_tlbmiss,fe_bubble_branch,fe_bubble_bubble ./loop
   27727 FE_BUBBLE_IMISS
    4910 FE_BUBBLE_TLBMISS
98094189 FE_BUBBLE_BRANCH
52615907 FE_BUBBLE_BUBBLE

$ pfmon -e fe_bubble_imiss,fe_bubble_tlbmiss,fe_bubble_branch,fe_bubble_bubble ./loop-unroll
780729 FE_BUBBLE_IMISS
  4858 FE_BUBBLE_TLBMISS
291620 FE_BUBBLE_BRANCH
111089 FE_BUBBLE_BUBBLE

What we can see here is that the increased code size of the unrolled version causes us a lot more i-cache misses, as expected. However, the problem for the unrolled version seems to be the time spent by the front-end dealing with the branch. This is not surprising since there is a rumor that there is a front-end buffer which fills up after 3 repeated cycles with branch instructions; apparently this is a problem addressed in the new Montecito chips (if anyone has one, I'll update this if you send me the numbers). Since we are doing 393216000 branches all within a single bundle its not surprising we've hit it!

Just to confirm, we can break-down the BE_EXE bubbles. Remember, BE_EXE bubbles are caused by stalls in the execution phase primarily due to cache latencies or inter-register dependencies. Looking at our code, we don't re-use the same register over-and-over so we would not expect to spend a long time waiting for registers, and indeed this is what we see.

$ pfmon -e be_exe_bubble_grall,be_exe_bubble_grgr ./loop
79601 BE_EXE_BUBBLE_GRALL
  919 BE_EXE_BUBBLE_GRGR

$ pfmon -e be_exe_bubble_grall,be_exe_bubble_grgr ./loop-unroll
76627 BE_EXE_BUBBLE_GRALL
  918 BE_EXE_BUBBLE_GRGR

Therefore in this case, the speed increase isn't coming from better cache behaviour as we might have expected, but greater ability to keep the processor busy with instructions between branches.

We can also examine the major events happening with the unrolled case. Here we see BE_L1D_FPU bubbles being the major overhead. On Itanium floating-point avoids the L1 cache (it is too big, and not re-used enough to make it useful) hence the concatenation of the event acronym. These bubbles can be broken down into the following events (measured in two runs, since you can only do 4 at a time).

$ pfmon -e be_l1d_fpu_bubble_l1d,be_l1d_fpu_bubble_l1d_dcurecir,\
  be_l1d_fpu_bubble_l1d_tlb,be_l1d_fpu_bubble_l1d_stbufrecir ./loop-unroll
 9623193 BE_L1D_FPU_BUBBLE_L1D_DCURECIR
     414 BE_L1D_FPU_BUBBLE_L1D_TLB
     282 BE_L1D_FPU_BUBBLE_L1D_STBUFRECIR
$ pfmon -e be_l1d_fpu_bubble_l1d_fullstbuf,\
    be_l1d_fpu_bubble_l1d_l2bpress,be_l1d_fpu_bubble_l1d_tlb ./loop-unroll
      23 BE_L1D_FPU_BUBBLE_L1D_FULLSTBUF
17220177 BE_L1D_FPU_BUBBLE_L1D_L2BPRESS
     425 BE_L1D_FPU_BUBBLE_L1D_TLB

These events expose micro-architectural details, the full extent of which is only gleaned by reading the processor manuals (a PhD in computer architecture probably helps too!). The store buffers are not an issue here -- the caches are more than able to keep up with our few writes. The TLB is also not involved; we have pretty good TLB coverage thanks to the virtual-linear page table and hardware page-table walker.

The two clues are DCURECIR and L2BPRESS. Essentially DCURECIR happens when references have to "re-circulate"; this is a side-effect of the data not being available. The L2BPRESS figure suggests why this is so; this event happens when the L2 cache can not handle any more requests. The L2 cache keeps a queue of outstanding requests; when this queue is full (for example, full of long latency requests to get data from main memory) the L2 cache causes "back-pressure" which stops the L1 cache from issuing new requests. This suggests the program is chewing through a lot of data and not using the cache very effectively, which is exactly what we are doing. Pre-fetching hints can help (but as described in a previous entry on hacking Enigma may not always help), but in this case there probably isn't any spare bandwidth to pre-fetch in because essentially no processing is happening.

What does all this mean? I don't know, but playing with performance monitors is fun!