I was playing with loop unrolling, and decided to investigate it a little. The basis is this very simple program which looks like it could do with some loop unrolling.
#include <stdio.h> #include <stdlib.h> #include <assert.h> #define SIZE (1500 * 1024 * 1024) int fn(int *buf) { int i; for(i=0; i < (SIZE/sizeof(int)); i+=1) buf[i] = i; return 0; } int main(void) { int *buf = malloc(SIZE); assert(buf); return fn(buf); }
Firstly, lets have a look at what happens to the function when -funroll-loops is applied. Below is the de-compilation with some comments on what each line is doing.
Non-optimised
| nop.m 0x0 r14=0 | mov r14=r0 save the "loop counter" | mov.i r2=ar.lc | nop.m 0x0 one less than iterations | movl r15=0x176fffff;; | nop.m 0x0 initalise loop counter | mov.i ar.lc=r15 | nop.i 0x0;; | loop: *buf = i, buf++ | st4 [r32]=r14,4 i++ | adds r14=1,r14 dec lc, branch if > 0 | br.cloop.sptk.few .loop | nop.m 0x0 restore loop counter | mov.i ar.lc=r2
Loop unrolled
| nop.m 0x0 r26 = 0 | mov r26=r0 save lc | mov.i r2=ar.lc | nop.m 0x0 iter / 32 | movl r14=0x2edffff;; | nop.m 0x0 setup loop counter | mov.i ar.lc=r14 | nop.i 0x0 | loop: r3 = buf | mov r3=r32 r8 = 1 | adds r8=1,r26 r25 = 3 | adds r25=3,r26 r23 = buf[3] | adds r23=12,r32 r24 = 4 | adds r24=4,r26 r22 = buf[4] | adds r22=16,r32;; buf[0] = 0; r3=buf+1 | st4 [r3]=r26,4 r21 = 5 | adds r21=5,r26 r19 = buf[5] | adds r19=20,r32 r20 = 6 | adds r20=6,r26 r18 = buf[6] | adds r18=24,r32 r16 = 7 | adds r16=7,r26;; | nop.m 0x0 buf[1] = 1, r3=buf+2 | st4 [r3]=r8,4 r15 = buf[7] | adds r15=28,r32 r17 = 2 | adds r17=1,r8 r26 += 8 | adds r26=8,r26 buf += 8 (for next iter) | adds r32=32,r32;; buf[2] = 2 | st4 [r3]=r17 buf[3] = 3 | st4 [r23]=r25 | nop.i 0x0 buf[4] = 4 | st4 [r22]=r24 buf[5] = 5 | st4 [r19]=r21 | nop.i 0x0 buf[6] = 6 | st4 [r18]=r20 buf[7] = 7 | st4 [r15]=r16 loop, dec branch | br.cloop.sptk.few .loop | nop.m 0x0 restore lc | mov.i ar.lc=r2 | br.ret.sptk.many b0;;
We can see that in the first case the loop counter is setup to run 393216000 times (e.g. 1500*1024*1024/sizeof(int)) and a very simple loop of storing the value and incrementing is undertaken. In the second version, the code becomes more complex. We can see the loop now executes 49152000 times (e.g. 1500*1024*1024/sizeof(int)/8) so we can infer that for each loop iteration, 8 values have been unrolled to be written into our buf array. Indeed, when pulled apart we can see the 8 stores (numbers just reflect the first time through).
So, is it faster?
$ time ./loop real 0m1.209s user 0m0.388s sys 0m0.820s $ timme ./loop-unroll real 0m1.056s user 0m0.244s sys 0m0.812s
Yes! Great. We can suggest a number of reasons why, but it would be interesting to get some hard statistics on why this is so. Thanks to the Itanium performance counters, we can. Firstly, lets start by seeing how cycles are being spent.
$ pfmon -e cpu_cycles,ia64_inst_retired_this,nops_retired,back_end_bubble_all ./loop 547923609 CPU_CYCLES 1179995250 IA64_INST_RETIRED_THIS 104431 NOPS_RETIRED 154285534 BACK_END_BUBBLE_ALL $ pfmon -e cpu_cycles,ia64_inst_retired_this,nops_retired,back_end_bubble_all ./loop-unroll 311570590 CPU_CYCLES 1327451286 IA64_INST_RETIRED_THIS 147560435 NOPS_RETIRED 16088889 BACK_END_BUBBLE_ALL
Now, the Itanium can issue up to six instructions (IA64_INST_RETIRED_THIS) in two bundles of three instructions per cycle (CPU_CYCLES). However, you can't have dependencies between registers within a 3-instruction bundle (and branches and interruptions play havoc) so sometimes you need to pad with no-ops (this is why you need a good compiler). We can work out some figures from this, namely the useful instructions per cycle (e.g. instructions retired - nop instructions / cycles).
Test | Useful instructions/cycle |
---|---|
loop | 1179995250 - 104431 / 547923609 = 2.15 |
unroll | 1327451286 - 147560435 / 311570590 = 3.78 |
This tells us that for each given cycle, the unrolled-loop version is doing more work (ideally we'd like that to be a the theoretical maximum of 6). So what is the unrolled version doing for all this time? Bubbles are where the CPU is waiting for something; a prime candidate is moving data from caches to registers. Bubbles make up (154285534/547823609) 28% of the cycles on the loop version, but only (16088889/311570590) 5% of the time for the unrolled version. This means the processor is spending a lot less time waiting for things to happen with the unrolled loop version.
We can drill down further to see what bubbles are occurring.
$ pfmon -e be_l1d_fpu_bubble_all,be_exe_bubble_all,be_RSE_bubble_all,Back_end_bubble_fe ./loop 14229581 BE_L1D_FPU_BUBBLE_ALL 97669 BE_EXE_BUBBLE_ALL 1260 BE_RSE_BUBBLE_ALL 131354866 BACK_END_BUBBLE_FE $ pfmon -e be_l1d_fpu_bubble_all,be_exe_bubble_all,be_RSE_bubble_all,Back_end_bubble_fe ./loop-unroll 22676169 BE_L1D_FPU_BUBBLE_ALL 83064 BE_EXE_BUBBLE_ALL 1263 BE_RSE_BUBBLE_ALL 718850 BACK_END_BUBBLE_FE
We are now breaking the stalls down to see where they are occuring. L1D_FPU_BUBBLES are essentially related to micro-architectural issues with the caches; things like the L1D cache being overwhelmend and interactions with the hardware page-table walker. BE_EXE bubbles are more simply described a waiting for data to come from caches (or even main memory) to registers. BE_RSE bubbles are related to the register stack engine, and "front end" bubbles (BACK_END_BUBBLE_FE) are where there are not enough instructions coming from the "front-end" for the "back-end" to issue out to functional units in the processor.
In this case, the major cause of problems for the simple loop seems to be not being able to get enough instructions through to keep the processor busy. We can try to pin down what is happening in the front-end:
$ pfmon -e fe_bubble_imiss,fe_bubble_tlbmiss,fe_bubble_branch,fe_bubble_bubble ./loop 27727 FE_BUBBLE_IMISS 4910 FE_BUBBLE_TLBMISS 98094189 FE_BUBBLE_BRANCH 52615907 FE_BUBBLE_BUBBLE $ pfmon -e fe_bubble_imiss,fe_bubble_tlbmiss,fe_bubble_branch,fe_bubble_bubble ./loop-unroll 780729 FE_BUBBLE_IMISS 4858 FE_BUBBLE_TLBMISS 291620 FE_BUBBLE_BRANCH 111089 FE_BUBBLE_BUBBLE
What we can see here is that the increased code size of the unrolled version causes us a lot more i-cache misses, as expected. However, the problem for the unrolled version seems to be the time spent by the front-end dealing with the branch. This is not surprising since there is a rumor that there is a front-end buffer which fills up after 3 repeated cycles with branch instructions; apparently this is a problem addressed in the new Montecito chips (if anyone has one, I'll update this if you send me the numbers). Since we are doing 393216000 branches all within a single bundle its not surprising we've hit it!
Just to confirm, we can break-down the BE_EXE bubbles. Remember, BE_EXE bubbles are caused by stalls in the execution phase primarily due to cache latencies or inter-register dependencies. Looking at our code, we don't re-use the same register over-and-over so we would not expect to spend a long time waiting for registers, and indeed this is what we see.
$ pfmon -e be_exe_bubble_grall,be_exe_bubble_grgr ./loop 79601 BE_EXE_BUBBLE_GRALL 919 BE_EXE_BUBBLE_GRGR $ pfmon -e be_exe_bubble_grall,be_exe_bubble_grgr ./loop-unroll 76627 BE_EXE_BUBBLE_GRALL 918 BE_EXE_BUBBLE_GRGR
Therefore in this case, the speed increase isn't coming from better cache behaviour as we might have expected, but greater ability to keep the processor busy with instructions between branches.
We can also examine the major events happening with the unrolled case. Here we see BE_L1D_FPU bubbles being the major overhead. On Itanium floating-point avoids the L1 cache (it is too big, and not re-used enough to make it useful) hence the concatenation of the event acronym. These bubbles can be broken down into the following events (measured in two runs, since you can only do 4 at a time).
$ pfmon -e be_l1d_fpu_bubble_l1d,be_l1d_fpu_bubble_l1d_dcurecir,\ be_l1d_fpu_bubble_l1d_tlb,be_l1d_fpu_bubble_l1d_stbufrecir ./loop-unroll 9623193 BE_L1D_FPU_BUBBLE_L1D_DCURECIR 414 BE_L1D_FPU_BUBBLE_L1D_TLB 282 BE_L1D_FPU_BUBBLE_L1D_STBUFRECIR $ pfmon -e be_l1d_fpu_bubble_l1d_fullstbuf,\ be_l1d_fpu_bubble_l1d_l2bpress,be_l1d_fpu_bubble_l1d_tlb ./loop-unroll 23 BE_L1D_FPU_BUBBLE_L1D_FULLSTBUF 17220177 BE_L1D_FPU_BUBBLE_L1D_L2BPRESS 425 BE_L1D_FPU_BUBBLE_L1D_TLB
These events expose micro-architectural details, the full extent of which is only gleaned by reading the processor manuals (a PhD in computer architecture probably helps too!). The store buffers are not an issue here -- the caches are more than able to keep up with our few writes. The TLB is also not involved; we have pretty good TLB coverage thanks to the virtual-linear page table and hardware page-table walker.
The two clues are DCURECIR and L2BPRESS. Essentially DCURECIR happens when references have to "re-circulate"; this is a side-effect of the data not being available. The L2BPRESS figure suggests why this is so; this event happens when the L2 cache can not handle any more requests. The L2 cache keeps a queue of outstanding requests; when this queue is full (for example, full of long latency requests to get data from main memory) the L2 cache causes "back-pressure" which stops the L1 cache from issuing new requests. This suggests the program is chewing through a lot of data and not using the cache very effectively, which is exactly what we are doing. Pre-fetching hints can help (but as described in a previous entry on hacking Enigma may not always help), but in this case there probably isn't any spare bandwidth to pre-fetch in because essentially no processing is happening.
What does all this mean? I don't know, but playing with performance monitors is fun!