GCC beats the Intel compiler at cracking Enigma?

I read about attempts to crack old Enigma messages on Bruce Schneier's Blog (then I think it hit Slashdot).

I've always liked Enigma since I read The Code Book, and I instantly grabbed the binary and put it on my local IA64 box, a 900Mhz Itanium2.

Usually for this sort of thing, the Intel compiler produces much better code than GCC. So I thought it might be fun to see how much better! Jeff Gilchrist gave me the incarnation to do a little benchmark, and the results surprised me:

$ time ./enigma -c -f B:524:AZ:AAA -t B:524:AZ:ZZZ dict/00trigr.gen dict/00bigr.gen ct/orig1
gcc -O3

real    3m30.219s
user    3m25.492s
sys     0m0.256s

icc -O3

real    5m25.663s
user    5m18.900s
sys     0m0.252s

That seems to be the wrong way around! The first step step seemed to be to fiddle with compiler options. -O3 can be a bit aggressive for the Intel compiler, and not knowing anything about the code base I wondered if profile guided optimisation might help (where you run the binary once with profiling code, and then recompile so the compiler has more information.

icc -O3 pgo

real    5m24.414s
user    5m18.324s
sys     0m0.264s

icc -O2 pgo

real    5m23.997s
user    5m12.680s
sys     0m0.284s

icc -O2

real    5m18.919s
user    5m12.224s
sys     0m0.212s

Clearly we're not helping here, but -O2 seems to be the place to start. Next step was to use a profiler to see where to start. q-syscollect showed me what was taking up the most time.

icc -O2 profile

% time      self     cumul     calls self/call  tot/call name
 51.87    147.51    147.51     50.2M     2.94u     2.94u triscore
 33.83     96.20    243.72     46.6M     2.07u     2.07u biscore
 10.36     29.47    273.19     13.1M     2.25u     2.27u icscore
  2.75      7.83    281.02         -         -         - hillclimb

gcc -O3 profile

% time      self     cumul     calls self/call  tot/call name
 54.71    102.35    102.35     49.9M     2.05u     2.05u triscore
 26.69     49.93    152.29     46.4M     1.08u     1.08u biscore
 12.69     23.74    176.03     13.1M     1.81u     1.89u icscore
  2.54      4.75    180.78     14.5k      327u      327u init_path_lookup_H_M3
  1.74      3.26    184.04         -         -         - hillclimb

So, triscore seems to be the place to start. Trying to handle code that runs for 5 minutes doesn't lead to a good debugging experience, so I pulled triscore out so I could analyse it separately.

$ time ./triscore-gcc

real    0m1.399s
user    0m1.384s
sys     0m0.004s

$ time ./triscore-icc

real    0m2.346s
user    0m2.316s
sys     0m0.000s

Ok, now we have a small test case that runs in a few seconds. The code I used is available here.

Time to pull out Caliper and start peeking under the hood.

Caliper dcache output gcc
=========================
L1 data cache miss percentage:
   9.79 = 100 * (L1D_READ_MISSES.ALL / L1D_READS)

Percent of data references accessing L1 data cache:
  99.89 = 100 * (L1D_READS / DATA_REFERENCES)

Caliper dcache output icc
=========================
L1 data cache miss percentage:
  17.63 = 100 * (L1D_READ_MISSES.ALL / L1D_READS)

Percent of data references accessing L1 data cache:
  81.76 = 100 * (L1D_READS / DATA_REFERENCES)

We can see the average latency for the misses:

% Total                                        Avg.
 Dcache  Cumulat        Sampled       Dcache  Dcache
Latency    % of          Dcache      Latency  Laten.
 Cycles    Total         Misses       Cycles  Cycles   Load Module
------------------------------------------------------------------
100.00   100.00            1919        12664     6.6   triscore-icc
100.00   100.00             995         6919     7.0   triscore-gcc

Which lets us approximate how much extra time we spend waiting for cache.

gcc
L1D_READ_MISSES.ALL     x___   0       91013769

icc
L1D_READ_MISSES.ALL     x___   0      166681605

With a cycle taking ~1ns, average latency of 7 cycles (166681605 - 91013769) * 1 * 9 = ~ 0.6 seconds.

Let's have a look at what the TLB thinks about all this.

icc
===
Percentage of data references covered by L1 and L2 DTLB:
  99.02 % = 100 * (1 - L2DTLB_MISSES / DATA_REFERENCES)

Percentage of data references covered by the HPW:
   0.00 % = 100 * (DTLB_INSERTS_HPW / DATA_REFERENCES)

Percentage of data references covered by software trap:
   0.98 % = 100 * ((L2DTLB_MISSES - DTLB_INSERTS_HPW) / DATA_REFERENCES)

gcc
===
Percentage of data references covered by L1 and L2 DTLB:
  100.00 % = 100 * (1 - L2DTLB_MISSES / DATA_REFERENCES)

Percentage of data references covered by the HPW:
   0.00 % = 100 * (DTLB_INSERTS_HPW / DATA_REFERENCES)

Percentage of data references covered by software trap:
   0.00 % = 100 * ((L2DTLB_MISSES - DTLB_INSERTS_HPW) / DATA_REFERENCES)

Software trap? That sounds slow. We can ask Caliper to show us the instructions that are causing the software trap.

Function Details
------------------------------------------------------------------------------------------------------
% Total                                                 %       %       %
Sampled       Sampled      DTLB      DTLB      DTLB   DTLB    DTLB    DTLB          Line|
   DTLB          DTLB       L2        HPW      Soft    L2      HPW    Soft          Slot|  >Statement|
 Misses        Misses     Fills     Fills     Fills   Fill    Fill    Fill     Col,Offset  Instruction
------------------------------------------------------------------------------------------------------
100.00     [test-icc::triscore, 0x40000000000008c0, triscore.c]
                 600       429         0       171    71.5     0.0    28.5              6  Function Totals
         ----------------------------------------------------------------------------------------------
[** cut out other stuff **]
                 144         0         0       144     0.0     0.0   100.0       0x0c70:0    M_      lfetch        [r87] ;;
                   0         0         0         0     0.0     0.0     0.0             :1    M       lfetch        [r67]
                   0         0         0         0     0.0     0.0     0.0             :2    I       nop.i         0

Ah, huh! lftech is an operation to pre-fetch data you know you're about to use into the cache. After discussions with the Caliper people, this doesn't actually raise a software fault to the operating system, but does actually go to memory via the hashed page table walker (more on that later). Let's check out that load.

(gdb) break *(triscore + 0xc70)
Breakpoint 1 at 0x4000000000001530: file triscore.c, line 26.
(gdb) r
Starting program: /home/ianw/programs/enigma-test/extract/enigma-test/triscore-icc

Breakpoint 1, 0x4000000000001530 in triscore (stbrett=0x0, ciphertext=0x0, len=0) at triscore.c:26
26        for (i = 2; i < len-15; i += 16) {
(gdb) print/x $r87
$1 = 0x600010010fe8f9c0
(gdb) print/x *($r87)
Cannot access memory at address 0x600010010fe8f9c0
(gdb) print/x $r67
$2 = 0x6000000000005be8
(gdb) print/x *($r67)
$3 = 0x0

Well, that address in r87 looks really bogus. It seems that what is happening is that ICC is being very aggressive with its prefetching algorithm, to the point it is getting it wrong. This brings us close to the answer, but we can just check out a few other things.

Firstly, these programs are using a lot of local registers. Maybe it is register stack engine traffic; after all maybe ICC is more aggressive with register allocations.

-- gcc -------------------------------------------
                            PLM
Event Name                 U..K  TH          Count
--------------------------------------------------
BACK_END_BUBBLE.ALL        x___   0      249132805
BE_RSE_BUBBLE.ALL          x___   0           3727
BE_RSE_BUBBLE.OVERFLOW     x___   0           1792
BE_RSE_BUBBLE.UNDERFLOW    x___   0           1935
--------------------------------------------------

-- icc -------------------------------------------
                            PLM
Event Name                 U..K  TH          Count
--------------------------------------------------
BACK_END_BUBBLE.ALL        x___   0      440096824
BE_RSE_BUBBLE.ALL          x___   0           4128
BE_RSE_BUBBLE.OVERFLOW     x___   0           1792
BE_RSE_BUBBLE.UNDERFLOW    x___   0           2336
--------------------------------------------------

Nup; but we can see that the pipeline stalls are coming from the backend (BACK_END_BUBBLE.ALL). This mostly rules out bad instruction scheduling (i.e. the processor doesn't have enough to do) and probably shows why IPO didn't make much difference before.

By poking into the backend pipeline even more, we can see overhead is really coming from increased TLB activity.

-- gcc --------------------------------------------------
                                   PLM
Event Name                        U..K  TH          Count
---------------------------------------------------------
BE_L1D_FPU_BUBBLE.L1D_DCURECIR    x___   0      100968826
BE_L1D_FPU_BUBBLE.L1D_HPW         x___   0           8937
BE_L1D_FPU_BUBBLE.L1D_L2BPRESS    x___   0           1358
BE_L1D_FPU_BUBBLE.L1D_TLB         x___   0        6354270
---------------------------------------------------------

-- icc --------------------------------------------------
                                   PLM
Event Name                        U..K  TH          Count
---------------------------------------------------------
BE_L1D_FPU_BUBBLE.L1D_DCURECIR    x___   0      431700491
BE_L1D_FPU_BUBBLE.L1D_HPW         x___   0      181649893
BE_L1D_FPU_BUBBLE.L1D_L2BPRESS    x___   0           3207
BE_L1D_FPU_BUBBLE.L1D_TLB         x___   0       33004161
---------------------------------------------------------

Here we have confirmation of what appears to be going on. The time spent waiting for the hardware page table walker is orders of magnitude greater with ICC.

On Itanium, there is a cache of the page tables kept in memory acting as a third level TLB. When a TLB miss is taken first the 1st level TLB is checked, then the larger and slower second level, and finally the hardware goes to main memory and checks the "virtually hashed page table". Obviously this third level, being in system memory and lower down the memory hierarchy is much slower.

The lfetch instruction is asking to bring in a bogus cache line for a far out virtual address. This requires a TLB mapping; which obviously doesn't exist. So the normal TLB fault path takes over, finally ending up going to main memory for the hashed page table check. This hashed page table page also has to be moved into the cache, which causes even more cache misses.

At this point, the processor can't find the mapping and gives up; only if it is an aggressive lfetch.fault will the fault go through to the operating system (which then walks the OS page tables).

As a final step, let's recompile and run the Enigma test with -no-prefetch.

real    3m39.243s
user    3m33.648s
sys     0m0.176s

Still about 9 seconds slower than gcc, but close enough for me! I still think that with this code, ICC is being too smart.

So, moral of the story? Firstly, check your code actually does run faster with another compiler (I only started testing with the assumption I would find out how much faster ICC was). Secondly, if your code is running slowly check for aggressive prefetching of cache lines, it might be doing more damage than harm by wasting time looking up TLB entries that never exist. Thirdly, Caliper (and the underlying perfmon and Itanium PMU system) is really cool, and next time someone says Itanium is crap ask them how they would track that problem down on processor de jour!