A discussion about
commas
lead to an excuse to have a play with the IA-32 processor performance
managment unit (PMU). To start, take two versions of a program to count
the number of commas in a text file — one in C and one in Python. The C
one runs faster on the input data set of ~60MiB of random data, but why?
The CPU performance monitors give are the key to getting some idea of
where the programs spend their time. I like to use
perfmon2 because it's what I know, but
Oprofile can do it too. All the available
events for IA-32 are described in the
manual; I
currently of no better way of finding out about them than just reading
it. On Itanium I reccommend Caliper
which, for the most common situations, does most of the work for you and
presents it in a nice report. Intel's
Vtune
also does a similar thing.
The first thing to investigate is if the CPU is getting enough
instructions to keep busy. The IFU_MEM_STALL metric is a good place
to start as it is triggered when the instruction fetch pipeline is
stalled, presumably waiting on either the ITLB or the trace buffer
(Icache).
$ pfmon -e CPU_CLK_UNHALTED,IFU_MEM_STALL ./comma < ./randomcommas
340559375 CPU_CLK_UNHALTED
192115 IFU_MEM_STALL
$ pfmon -e CPU_CLK_UNHALTED,IFU_MEM_STALL python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
1287100
4571257047 CPU_CLK_UNHALTED
71981750 IFU_MEM_STALL
That works out to 0.05% of total cycles for the C version and 1.5% for
the Python version, neither of which sets off immediate warning bells.
If it did, we could start drilling down to things like the L2_IFETCH
and ITLB_MISS events, or the BR_* branch events to try and see
why the CPU is having to wait to get its next instruction.
Next it is useful to find the CPI (cycles per instruction). This is
calculated by the ratio of retired instructions against the number of
CPU cycles; since a superscalar machine can issue more than one
instruction per cycle this should ideally be much greater than 1 (for
example, an Itanium can execute up to 6 instructions each cycle).
$ pfmon -e INST_RETIRED,CPU_CLK_UNHALTED ./comma < ./randomcommas
542953593 INST_RETIRED
340612036 CPU_CLK_UNHALTED
$ pfmon -e INST_RETIRED,CPU_CLK_UNHALTED python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
1194455205 INST_RETIRED
4569931735 CPU_CLK_UNHALTED
This works out at a CPI of 1.59 for the C version and 0.26 for the
Python version. The Python version is clearly spending a lot of time
waiting, because it isn't even able to issue one instruction every
cycle.
At this point it seems the CPU has enough instructions to do, but it is
sitting around waiting to get through those instructions. This suggests
the waiting is related to getting data from the cache.
The load and store requests from the L2 cache are accounted to the
L2_LD and L2_ST events respectively. These have the ability to
mask out cache lines in different states of the MESI protocol, but for
this we don't care so just ask pfmon to show us everything.
$ pfmon -e L2_LD:M:E:S:I,L2_ST:M:E:S:I ./comma < randomcommas
102505 L2_LD:M:E:S:I
167 L2_ST:M:E:S:I
$ pfmon -e L2_LD:M:E:S:I,L2_ST:M:E:S:I python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
3278774 L2_LD:M:E:S:I
10457 L2_ST:M:E:S:I
This shows us that the Python version does quite a few more stores than
the C counterpart. Considering this program should simply be reading the
input stream and counting the number of commas, we do not expect much
store traffic at all. This suggests the Python version is doing some
extra copying, for whatever reason (maybe some Python expert can
pinpoint it?).
We can drill down a bit more into the memory related latencies. The
DCU_LINES_IN event gives the total number of lines allocated in the
cache. Another event, DCU_MISS_OUTSTANDING, gives a weighted measure
of the cycles spent waiting for a cache line to be brought in. Each
cycle spent waiting is weighted by the number of outstanding cache
misses (I think the Pentium M I'm using can have up to 4 cache miss
requests outstanding at once) and has some caveats, but can be
considered a rough estimate of the time spent waiting for a cache line
to be brought in. Therefore divding DCU_MISS_OUTSTANDING by
DCU_LINES_IN gives us an approximate metric of how long a cache miss
takes.
$ pfmon -e DCU_MISS_OUTSTANDING,DCU_LINES_IN ./comma < randomcommas
769736 DCU_MISS_OUTSTANDING
102387 DCU_LINES_IN
$ pfmon -e DCU_MISS_OUTSTANDING,DCU_LINES_IN python -Sc "import sys; print sum(l.count(',') for l in sys.stdin)" < ./randomcommas
99810150 DCU_MISS_OUTSTANDING
4240179 DCU_LINES_IN
So that works out to 7.5 cycles for the C version and 23 cycles for the
Python version. This seems to strongly suggest that it is memory traffic
that is weighing the Python version down.
That is only the initial phase; the results give a high level idea of
why one program is running slower than the other. The initial analysis
phase generally consists of taking the ratios of certain events to try
and get some idea in your head of what the program is doing. Then comes
the really hard work; drilling down to figure out how to fix it!
Some useful references: