I read about attempts to crack old Enigma messages on Bruce Schneier's Blog (then I think it hit Slashdot).
I've always liked Enigma since I read The Code Book, and I instantly grabbed the binary and put it on my local IA64 box, a 900Mhz Itanium2.
Usually for this sort of thing, the Intel compiler produces much better code than GCC. So I thought it might be fun to see how much better! Jeff Gilchrist gave me the incarnation to do a little benchmark, and the results surprised me:
$ time ./enigma -c -f B:524:AZ:AAA -t B:524:AZ:ZZZ dict/00trigr.gen dict/00bigr.gen ct/orig1
gcc -O3 real 3m30.219s user 3m25.492s sys 0m0.256s icc -O3 real 5m25.663s user 5m18.900s sys 0m0.252s
That seems to be the wrong way around! The first step step seemed to be to fiddle with compiler options. -O3 can be a bit aggressive for the Intel compiler, and not knowing anything about the code base I wondered if profile guided optimisation might help (where you run the binary once with profiling code, and then recompile so the compiler has more information.
icc -O3 pgo real 5m24.414s user 5m18.324s sys 0m0.264s icc -O2 pgo real 5m23.997s user 5m12.680s sys 0m0.284s icc -O2 real 5m18.919s user 5m12.224s sys 0m0.212s
Clearly we're not helping here, but -O2 seems to be the place to start. Next step was to use a profiler to see where to start. q-syscollect showed me what was taking up the most time.
icc -O2 profile % time self cumul calls self/call tot/call name 51.87 147.51 147.51 50.2M 2.94u 2.94u triscore 33.83 96.20 243.72 46.6M 2.07u 2.07u biscore 10.36 29.47 273.19 13.1M 2.25u 2.27u icscore 2.75 7.83 281.02 - - - hillclimb gcc -O3 profile % time self cumul calls self/call tot/call name 54.71 102.35 102.35 49.9M 2.05u 2.05u triscore 26.69 49.93 152.29 46.4M 1.08u 1.08u biscore 12.69 23.74 176.03 13.1M 1.81u 1.89u icscore 2.54 4.75 180.78 14.5k 327u 327u init_path_lookup_H_M3 1.74 3.26 184.04 - - - hillclimb
So, triscore seems to be the place to start. Trying to handle code that runs for 5 minutes doesn't lead to a good debugging experience, so I pulled triscore out so I could analyse it separately.
$ time ./triscore-gcc real 0m1.399s user 0m1.384s sys 0m0.004s $ time ./triscore-icc real 0m2.346s user 0m2.316s sys 0m0.000s
Ok, now we have a small test case that runs in a few seconds. The code I used is available here.
Time to pull out Caliper and start peeking under the hood.
Caliper dcache output gcc ========================= L1 data cache miss percentage: 9.79 = 100 * (L1D_READ_MISSES.ALL / L1D_READS) Percent of data references accessing L1 data cache: 99.89 = 100 * (L1D_READS / DATA_REFERENCES) Caliper dcache output icc ========================= L1 data cache miss percentage: 17.63 = 100 * (L1D_READ_MISSES.ALL / L1D_READS) Percent of data references accessing L1 data cache: 81.76 = 100 * (L1D_READS / DATA_REFERENCES)
We can see the average latency for the misses:
% Total Avg. Dcache Cumulat Sampled Dcache Dcache Latency % of Dcache Latency Laten. Cycles Total Misses Cycles Cycles Load Module ------------------------------------------------------------------ 100.00 100.00 1919 12664 6.6 triscore-icc 100.00 100.00 995 6919 7.0 triscore-gcc
Which lets us approximate how much extra time we spend waiting for cache.
gcc L1D_READ_MISSES.ALL x___ 0 91013769 icc L1D_READ_MISSES.ALL x___ 0 166681605
With a cycle taking ~1ns, average latency of 7 cycles (166681605 - 91013769) * 1 * 9 = ~ 0.6 seconds.
Let's have a look at what the TLB thinks about all this.
icc === Percentage of data references covered by L1 and L2 DTLB: 99.02 % = 100 * (1 - L2DTLB_MISSES / DATA_REFERENCES) Percentage of data references covered by the HPW: 0.00 % = 100 * (DTLB_INSERTS_HPW / DATA_REFERENCES) Percentage of data references covered by software trap: 0.98 % = 100 * ((L2DTLB_MISSES - DTLB_INSERTS_HPW) / DATA_REFERENCES) gcc === Percentage of data references covered by L1 and L2 DTLB: 100.00 % = 100 * (1 - L2DTLB_MISSES / DATA_REFERENCES) Percentage of data references covered by the HPW: 0.00 % = 100 * (DTLB_INSERTS_HPW / DATA_REFERENCES) Percentage of data references covered by software trap: 0.00 % = 100 * ((L2DTLB_MISSES - DTLB_INSERTS_HPW) / DATA_REFERENCES)
Software trap? That sounds slow. We can ask Caliper to show us the instructions that are causing the software trap.
Function Details ------------------------------------------------------------------------------------------------------ % Total % % % Sampled Sampled DTLB DTLB DTLB DTLB DTLB DTLB Line| DTLB DTLB L2 HPW Soft L2 HPW Soft Slot| >Statement| Misses Misses Fills Fills Fills Fill Fill Fill Col,Offset Instruction ------------------------------------------------------------------------------------------------------ 100.00 [test-icc::triscore, 0x40000000000008c0, triscore.c] 600 429 0 171 71.5 0.0 28.5 6 Function Totals ---------------------------------------------------------------------------------------------- [** cut out other stuff **] 144 0 0 144 0.0 0.0 100.0 0x0c70:0 M_ lfetch [r87] ;; 0 0 0 0 0.0 0.0 0.0 :1 M lfetch [r67] 0 0 0 0 0.0 0.0 0.0 :2 I nop.i 0
Ah, huh! lftech is an operation to pre-fetch data you know you're about to use into the cache. After discussions with the Caliper people, this doesn't actually raise a software fault to the operating system, but does actually go to memory via the hashed page table walker (more on that later). Let's check out that load.
(gdb) break *(triscore + 0xc70) Breakpoint 1 at 0x4000000000001530: file triscore.c, line 26. (gdb) r Starting program: /home/ianw/programs/enigma-test/extract/enigma-test/triscore-icc Breakpoint 1, 0x4000000000001530 in triscore (stbrett=0x0, ciphertext=0x0, len=0) at triscore.c:26 26 for (i = 2; i < len-15; i += 16) { (gdb) print/x $r87 $1 = 0x600010010fe8f9c0 (gdb) print/x *($r87) Cannot access memory at address 0x600010010fe8f9c0 (gdb) print/x $r67 $2 = 0x6000000000005be8 (gdb) print/x *($r67) $3 = 0x0
Well, that address in r87 looks really bogus. It seems that what is happening is that ICC is being very aggressive with its prefetching algorithm, to the point it is getting it wrong. This brings us close to the answer, but we can just check out a few other things.
Firstly, these programs are using a lot of local registers. Maybe it is register stack engine traffic; after all maybe ICC is more aggressive with register allocations.
-- gcc ------------------------------------------- PLM Event Name U..K TH Count -------------------------------------------------- BACK_END_BUBBLE.ALL x___ 0 249132805 BE_RSE_BUBBLE.ALL x___ 0 3727 BE_RSE_BUBBLE.OVERFLOW x___ 0 1792 BE_RSE_BUBBLE.UNDERFLOW x___ 0 1935 -------------------------------------------------- -- icc ------------------------------------------- PLM Event Name U..K TH Count -------------------------------------------------- BACK_END_BUBBLE.ALL x___ 0 440096824 BE_RSE_BUBBLE.ALL x___ 0 4128 BE_RSE_BUBBLE.OVERFLOW x___ 0 1792 BE_RSE_BUBBLE.UNDERFLOW x___ 0 2336 --------------------------------------------------
Nup; but we can see that the pipeline stalls are coming from the backend (BACK_END_BUBBLE.ALL). This mostly rules out bad instruction scheduling (i.e. the processor doesn't have enough to do) and probably shows why IPO didn't make much difference before.
By poking into the backend pipeline even more, we can see overhead is really coming from increased TLB activity.
-- gcc -------------------------------------------------- PLM Event Name U..K TH Count --------------------------------------------------------- BE_L1D_FPU_BUBBLE.L1D_DCURECIR x___ 0 100968826 BE_L1D_FPU_BUBBLE.L1D_HPW x___ 0 8937 BE_L1D_FPU_BUBBLE.L1D_L2BPRESS x___ 0 1358 BE_L1D_FPU_BUBBLE.L1D_TLB x___ 0 6354270 --------------------------------------------------------- -- icc -------------------------------------------------- PLM Event Name U..K TH Count --------------------------------------------------------- BE_L1D_FPU_BUBBLE.L1D_DCURECIR x___ 0 431700491 BE_L1D_FPU_BUBBLE.L1D_HPW x___ 0 181649893 BE_L1D_FPU_BUBBLE.L1D_L2BPRESS x___ 0 3207 BE_L1D_FPU_BUBBLE.L1D_TLB x___ 0 33004161 ---------------------------------------------------------
Here we have confirmation of what appears to be going on. The time spent waiting for the hardware page table walker is orders of magnitude greater with ICC.
On Itanium, there is a cache of the page tables kept in memory acting as a third level TLB. When a TLB miss is taken first the 1st level TLB is checked, then the larger and slower second level, and finally the hardware goes to main memory and checks the "virtually hashed page table". Obviously this third level, being in system memory and lower down the memory hierarchy is much slower.
The lfetch instruction is asking to bring in a bogus cache line for a far out virtual address. This requires a TLB mapping; which obviously doesn't exist. So the normal TLB fault path takes over, finally ending up going to main memory for the hashed page table check. This hashed page table page also has to be moved into the cache, which causes even more cache misses.
At this point, the processor can't find the mapping and gives up; only if it is an aggressive lfetch.fault will the fault go through to the operating system (which then walks the OS page tables).
As a final step, let's recompile and run the Enigma test with -no-prefetch.
real 3m39.243s user 3m33.648s sys 0m0.176s
Still about 9 seconds slower than gcc, but close enough for me! I still think that with this code, ICC is being too smart.
So, moral of the story? Firstly, check your code actually does run faster with another compiler (I only started testing with the assumption I would find out how much faster ICC was). Secondly, if your code is running slowly check for aggressive prefetching of cache lines, it might be doing more damage than harm by wasting time looking up TLB entries that never exist. Thirdly, Caliper (and the underlying perfmon and Itanium PMU system) is really cool, and next time someone says Itanium is crap ask them how they would track that problem down on processor de jour!