I read about attempts to crack old Enigma
messages on Bruce
Schneier's
Blog
(then I think it hit Slashdot).
I've always liked Enigma since I read The Code
Book, and I instantly
grabbed the binary and put it on my local IA64 box, a 900Mhz Itanium2.
Usually for this sort of thing, the Intel compiler produces much better
code than GCC. So I thought it might be fun to see how much better! Jeff
Gilchrist gave me the incarnation to do a little benchmark, and the
results surprised me:
$ time ./enigma -c -f B:524:AZ:AAA -t B:524:AZ:ZZZ dict/00trigr.gen dict/00bigr.gen ct/orig1
gcc -O3
real 3m30.219s
user 3m25.492s
sys 0m0.256s
icc -O3
real 5m25.663s
user 5m18.900s
sys 0m0.252s
That seems to be the wrong way around! The first step step seemed to be
to fiddle with compiler options. -O3 can be a bit aggressive for the
Intel compiler, and not knowing anything about the code base I wondered
if profile guided optimisation might help (where you run the binary once
with profiling code, and then recompile so the compiler has more
information.
icc -O3 pgo
real 5m24.414s
user 5m18.324s
sys 0m0.264s
icc -O2 pgo
real 5m23.997s
user 5m12.680s
sys 0m0.284s
icc -O2
real 5m18.919s
user 5m12.224s
sys 0m0.212s
Clearly we're not helping here, but -O2 seems to be the place to
start. Next step was to use a profiler to see where to start.
q-syscollect showed
me what was taking up the most time.
icc -O2 profile
% time self cumul calls self/call tot/call name
51.87 147.51 147.51 50.2M 2.94u 2.94u triscore
33.83 96.20 243.72 46.6M 2.07u 2.07u biscore
10.36 29.47 273.19 13.1M 2.25u 2.27u icscore
2.75 7.83 281.02 - - - hillclimb
gcc -O3 profile
% time self cumul calls self/call tot/call name
54.71 102.35 102.35 49.9M 2.05u 2.05u triscore
26.69 49.93 152.29 46.4M 1.08u 1.08u biscore
12.69 23.74 176.03 13.1M 1.81u 1.89u icscore
2.54 4.75 180.78 14.5k 327u 327u init_path_lookup_H_M3
1.74 3.26 184.04 - - - hillclimb
So, triscore seems to be the place to start. Trying to handle code
that runs for 5 minutes doesn't lead to a good debugging experience, so
I pulled triscore out so I could analyse it separately.
$ time ./triscore-gcc
real 0m1.399s
user 0m1.384s
sys 0m0.004s
$ time ./triscore-icc
real 0m2.346s
user 0m2.316s
sys 0m0.000s
Ok, now we have a small test case that runs in a few seconds. The code I
used is available
here.
Time to pull out Caliper and start
peeking under the hood.
Caliper dcache output gcc
=========================
L1 data cache miss percentage:
9.79 = 100 * (L1D_READ_MISSES.ALL / L1D_READS)
Percent of data references accessing L1 data cache:
99.89 = 100 * (L1D_READS / DATA_REFERENCES)
Caliper dcache output icc
=========================
L1 data cache miss percentage:
17.63 = 100 * (L1D_READ_MISSES.ALL / L1D_READS)
Percent of data references accessing L1 data cache:
81.76 = 100 * (L1D_READS / DATA_REFERENCES)
We can see the average latency for the misses:
% Total Avg.
Dcache Cumulat Sampled Dcache Dcache
Latency % of Dcache Latency Laten.
Cycles Total Misses Cycles Cycles Load Module
------------------------------------------------------------------
100.00 100.00 1919 12664 6.6 triscore-icc
100.00 100.00 995 6919 7.0 triscore-gcc
Which lets us approximate how much extra time we spend waiting for
cache.
gcc
L1D_READ_MISSES.ALL x___ 0 91013769
icc
L1D_READ_MISSES.ALL x___ 0 166681605
With a cycle taking ~1ns, average latency of 7 cycles (166681605 -
91013769) * 1 * 9 = ~ 0.6 seconds.
Let's have a look at what the TLB thinks about all this.
icc
===
Percentage of data references covered by L1 and L2 DTLB:
99.02 % = 100 * (1 - L2DTLB_MISSES / DATA_REFERENCES)
Percentage of data references covered by the HPW:
0.00 % = 100 * (DTLB_INSERTS_HPW / DATA_REFERENCES)
Percentage of data references covered by software trap:
0.98 % = 100 * ((L2DTLB_MISSES - DTLB_INSERTS_HPW) / DATA_REFERENCES)
gcc
===
Percentage of data references covered by L1 and L2 DTLB:
100.00 % = 100 * (1 - L2DTLB_MISSES / DATA_REFERENCES)
Percentage of data references covered by the HPW:
0.00 % = 100 * (DTLB_INSERTS_HPW / DATA_REFERENCES)
Percentage of data references covered by software trap:
0.00 % = 100 * ((L2DTLB_MISSES - DTLB_INSERTS_HPW) / DATA_REFERENCES)
Software trap? That sounds slow. We can ask Caliper to show us the
instructions that are causing the software trap.
Function Details
------------------------------------------------------------------------------------------------------
% Total % % %
Sampled Sampled DTLB DTLB DTLB DTLB DTLB DTLB Line|
DTLB DTLB L2 HPW Soft L2 HPW Soft Slot| >Statement|
Misses Misses Fills Fills Fills Fill Fill Fill Col,Offset Instruction
------------------------------------------------------------------------------------------------------
100.00 [test-icc::triscore, 0x40000000000008c0, triscore.c]
600 429 0 171 71.5 0.0 28.5 6 Function Totals
----------------------------------------------------------------------------------------------
[** cut out other stuff **]
144 0 0 144 0.0 0.0 100.0 0x0c70:0 M_ lfetch [r87] ;;
0 0 0 0 0.0 0.0 0.0 :1 M lfetch [r67]
0 0 0 0 0.0 0.0 0.0 :2 I nop.i 0
Ah, huh! lftech is an operation to pre-fetch data you know you're
about to use into the cache. After discussions with the Caliper people,
this doesn't actually raise a software fault to the operating system,
but does actually go to memory via the hashed page table walker (more on
that later). Let's check out that load.
(gdb) break *(triscore + 0xc70)
Breakpoint 1 at 0x4000000000001530: file triscore.c, line 26.
(gdb) r
Starting program: /home/ianw/programs/enigma-test/extract/enigma-test/triscore-icc
Breakpoint 1, 0x4000000000001530 in triscore (stbrett=0x0, ciphertext=0x0, len=0) at triscore.c:26
26 for (i = 2; i < len-15; i += 16) {
(gdb) print/x $r87
$1 = 0x600010010fe8f9c0
(gdb) print/x *($r87)
Cannot access memory at address 0x600010010fe8f9c0
(gdb) print/x $r67
$2 = 0x6000000000005be8
(gdb) print/x *($r67)
$3 = 0x0
Well, that address in r87 looks really bogus. It seems that what is
happening is that ICC is being very aggressive with its prefetching
algorithm, to the point it is getting it wrong. This brings us close to
the answer, but we can just check out a few other things.
Firstly, these programs are using a lot of local registers. Maybe it is
register stack engine traffic; after all maybe ICC is more aggressive
with register allocations.
-- gcc -------------------------------------------
PLM
Event Name U..K TH Count
--------------------------------------------------
BACK_END_BUBBLE.ALL x___ 0 249132805
BE_RSE_BUBBLE.ALL x___ 0 3727
BE_RSE_BUBBLE.OVERFLOW x___ 0 1792
BE_RSE_BUBBLE.UNDERFLOW x___ 0 1935
--------------------------------------------------
-- icc -------------------------------------------
PLM
Event Name U..K TH Count
--------------------------------------------------
BACK_END_BUBBLE.ALL x___ 0 440096824
BE_RSE_BUBBLE.ALL x___ 0 4128
BE_RSE_BUBBLE.OVERFLOW x___ 0 1792
BE_RSE_BUBBLE.UNDERFLOW x___ 0 2336
--------------------------------------------------
Nup; but we can see that the pipeline stalls are coming from the backend
(BACK_END_BUBBLE.ALL). This mostly rules out bad instruction
scheduling (i.e. the processor doesn't have enough to do) and probably
shows why IPO didn't make much difference before.
By poking into the backend pipeline even more, we can see overhead is
really coming from increased TLB activity.
-- gcc --------------------------------------------------
PLM
Event Name U..K TH Count
---------------------------------------------------------
BE_L1D_FPU_BUBBLE.L1D_DCURECIR x___ 0 100968826
BE_L1D_FPU_BUBBLE.L1D_HPW x___ 0 8937
BE_L1D_FPU_BUBBLE.L1D_L2BPRESS x___ 0 1358
BE_L1D_FPU_BUBBLE.L1D_TLB x___ 0 6354270
---------------------------------------------------------
-- icc --------------------------------------------------
PLM
Event Name U..K TH Count
---------------------------------------------------------
BE_L1D_FPU_BUBBLE.L1D_DCURECIR x___ 0 431700491
BE_L1D_FPU_BUBBLE.L1D_HPW x___ 0 181649893
BE_L1D_FPU_BUBBLE.L1D_L2BPRESS x___ 0 3207
BE_L1D_FPU_BUBBLE.L1D_TLB x___ 0 33004161
---------------------------------------------------------
Here we have confirmation of what appears to be going on. The time spent
waiting for the hardware page table walker is orders of magnitude
greater with ICC.
On Itanium, there is a cache of the page tables kept in memory acting as
a third level TLB. When a TLB miss is taken first the 1st level TLB is
checked, then the larger and slower second level, and finally the
hardware goes to main memory and checks the "virtually hashed page
table". Obviously this third level, being in system memory and lower
down the memory hierarchy is much slower.
The lfetch instruction is asking to bring in a bogus cache line for
a far out virtual address. This requires a TLB mapping; which obviously
doesn't exist. So the normal TLB fault path takes over, finally ending
up going to main memory for the hashed page table check. This hashed
page table page also has to be moved into the cache, which causes even
more cache misses.
At this point, the processor can't find the mapping and gives up; only
if it is an aggressive lfetch.fault will the fault go through to the
operating system (which then walks the OS page tables).
As a final step, let's recompile and run the Enigma test with
-no-prefetch.
real 3m39.243s
user 3m33.648s
sys 0m0.176s
Still about 9 seconds slower than gcc, but close enough for me! I still
think that with this code, ICC is being too smart.
So, moral of the story? Firstly, check your code actually does run
faster with another compiler (I only started testing with the assumption
I would find out how much faster ICC was). Secondly, if your code is
running slowly check for aggressive prefetching of cache lines, it might
be doing more damage than harm by wasting time looking up TLB entries
that never exist. Thirdly, Caliper (and the underlying perfmon and
Itanium PMU system) is really cool, and next time someone says Itanium
is crap ask them how they would track that problem down on processor de
jour!