Chronology of a waste of time

First vendors start releasing things with SD card readers, and people quickly realise Linux can't use them. Then several very talented hackers start a project <http://mmc.drzeus.cx/wiki/Linux/Drivers/sdhci> to reverse engineer SDHCI controllers, since the SD Card Association won't release specs publically (apparently citing "trade secrets").

Talented hackers get things working so well that it makes it into 2.6.17-rc1 <http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=history;h=6246b6128bbe34d0752f119cf7c5111c85fe481d;f=drivers/mmc/sdhci.c>.

After all this is done, the SD Card Assosociation, seemingly without any prior communication, release a simplified spec http://www.sdcard.org/HostController/index.html> now that the community has figured it out.

Of course it acts as a good check that what was reverse engineered is correct, and reverse engineering things can certainly be a lot of fun. But really, what a lot of wasted resources thanks to a backwards vendor. What exactly have the SD Card Association gained, apart from bad will?

JavaScript Captcha

Comment spam recently started annoying me, again. Even though it only takes a few clicks to kill, some crappy robot is still taking up my time. Archreality has a nice little JavaScript based Captcha which has the potential to be completely useless as you can see if you click here.

However, I think I've managed to plug it into PyBlosxom and have it be useful. The trick was to make the form submission write a hidden field value in the form with a password after passing the Captcha test.

So, you set up your form like this

<form action="$base_url/$file_path.html" method="post"
name="comments_form" id="comments_form"
onsubmit="try {
           var myValidator = validate_comments_form;
          } catch(e) { return true; }
         return myValidator(this);
         "
>
 ... blah ...
 <input type="hidden" name="secret" value="" />
 ... blah ...
</form>

Then in validate_comments_form do something like

function validate_comments_form(frm) {

  captcha = jcap();
  if ( captcha  ) {
        frm.elements['secret'].value = "you_may_pass";
  }
  return captcha;
}

Then, finally modify comments.py from PyBlosxom to check for this value in cb_prepare.

def cb_prepare(args):
    ... somewhere near the top ...
    #captcha
    if (not form.has_key('secret')):
        return False
    if (form['secret'].value != 'you_may_pass'):
        return False

Now, if the spammer is human they will pass anyway, so we don't have to worry about that. Any bot stupidly submitting the form won't have filled out the secret key, so will get dropped. If the bot somehow interprets the javascript, then it's going to have to have enough logic to somehow parse the code and realise it needs to put in that secret value for anything to happen. If your bot can do that then I'm willing to let your spam remain.

I don't like the fact that these things lock out people without great eyesight. I don't see why a alt tag with "Put the value blah into the next box" wouldn't be sufficient, but currently it writes the image in with javascript so I don't know how that would work with a screen reader. In the todo list...

xfig and colours

User defined colours in xfig are per file. To get a decent selection of colours, a good way is to have the xfig-libs package installed and import one of the complex example pictures from the library. This way you get all its colours. But be careful; these colours can make it into exported postscript and cause your interpreter to really crawl.

This tip courtesy of Charles Gray, who is currently in the process of obtaining a PhD in xfigology.

Unrolling and Duff's Device

Loop unrolling can be a fascinating art.

The canonical example of loop unrolling is something like

for (i = 0; i < 100; i++)
    x[i] = i;

becoming

for (i = 0; i < 25; i+=4) {
    x[i] = i;
    x[i+1] = i+1;
    x[i+2] = i+2;
    x[i+3] = i+3;
}

Longer runs of instructions afford you a better chance of managing to find instruction level parallelism; that is instructions that can all happen together. You can keep the pipeline fuller and hence run faster, at the cost of some expanded code size.

But unlike that example, you generally do not know where your loop ends. In the above example, we made four copies of the loop body, reducing the iterations by 4. For an indeterminate number of iterations, we would need to do the original loop n % 4 times and then the unrolled body floor(n / 4) times. So to spell it out, if we want to do the loop 7 times, then we execute the original loop 3 times and the unrolled portion once.

Our new code might look something like

int extras = n % 4;
for (i = 0; i < extras; i++)
    x[i] += i;

for ( ; i < n / 4 ; i+=4) {
    x[i] = i;
    x[i+1] = i+1;
    x[i+2] = i+2;
    x[i+3] = i+3;
}

That leads to a little trick called "Duff's Device", references to which can be found most anywhere (however I found most of them a little unclear, hence this post).

Duff's device looks like this

send(to, from, count)
register short *to, *from;
register count;
{
        register n=(count+7)/8;

        switch(count%8) {
        case 0: do {    *to = *from++;
        case 7:         *to = *from++;
        case 6:         *to = *from++;
        case 5:         *to = *from++;
        case 4:         *to = *from++;
        case 3:         *to = *from++;
        case 2:         *to = *from++;
        case 1:         *to = *from++;
                } while(--n>0);
        }
}

Now that doesn't look like loop unrolling, but it really is. In this case, we have unwound a loop doing simply *to = *from 8 times. But we still have that extra bit problem, since we have an indeterminate count.

To understand it, let's just work through what happens for possible inputs. For values of count from 1-7, n is one and count % 8 is just count so we fall right into the case statements, which fall through to each other without break and never repeat. The nifty bit comes when count is greater than 7. In this case, our first iteration comes from the switch where we fall through to do the extra count % 8 iterations. After that, we loop back to the start of the unrolling and do our 8 step loop as many times as we require.

Now by post-incrementing *to you can use this to implement memcpy(). An interesting exercise might be to see how much slower this is than the highly optimised memcpy that comes with glibc.

The code is most certainly ingenious and this page explains some of the history of it. Today I think it is mostly good as an exercise on "why it works".

If you're wondering how gcc handles unrolling, it creates the unrolled loop and the does a sort of fall through test which evaluates how many extra loops are required and branches into the unrolled portion at the correct point. A little extract is below, you can see r14 holds the number of extra loops required and branches higher up the unrolled code for each extra loop required.

...
  70: [MFB]       cmp4.eq p6,p7=0,r14
  76:             nop.f 0x0
  7c:       (p06) br.cond.dpnt.few 150 ;;
  80: [MFB]       cmp4.eq p6,p7=1,r14
  86:             nop.f 0x0
  8c:       (p06) br.cond.dpnt.few 130 ;;
  90: [MFB]       cmp4.eq p6,p7=2,r14
  96:             nop.f 0x0
  9c:       (p06) br.cond.dpnt.few 120 ;;
  a0: [MFB]       cmp4.eq p6,p7=3,r14
  a6:             nop.f 0x0
  ac:       (p06) br.cond.dpnt.few 110 ;;
  b0: [MFB]       cmp4.eq p6,p7=4,r14
  b6:             nop.f 0x0
  bc:       (p06) br.cond.dpnt.few 100 ;;
  c0: [MFB]       cmp4.eq p6,p7=5,r14
  c6:             nop.f 0x0
  cc:       (p06) br.cond.dpnt.few f0 ;;
  d0: [MMI]       cmp4.eq p6,p7=6,r14;;
  d6:       (p07) st4 [r16]=r17,4
  dc:             nop.i 0x0
...

GCC beats the Intel compiler at cracking Enigma?

I read about attempts to crack old Enigma messages on Bruce Schneier's Blog (then I think it hit Slashdot).

I've always liked Enigma since I read The Code Book, and I instantly grabbed the binary and put it on my local IA64 box, a 900Mhz Itanium2.

Usually for this sort of thing, the Intel compiler produces much better code than GCC. So I thought it might be fun to see how much better! Jeff Gilchrist gave me the incarnation to do a little benchmark, and the results surprised me:

$ time ./enigma -c -f B:524:AZ:AAA -t B:524:AZ:ZZZ dict/00trigr.gen dict/00bigr.gen ct/orig1
gcc -O3

real    3m30.219s
user    3m25.492s
sys     0m0.256s

icc -O3

real    5m25.663s
user    5m18.900s
sys     0m0.252s

That seems to be the wrong way around! The first step step seemed to be to fiddle with compiler options. -O3 can be a bit aggressive for the Intel compiler, and not knowing anything about the code base I wondered if profile guided optimisation might help (where you run the binary once with profiling code, and then recompile so the compiler has more information.

icc -O3 pgo

real    5m24.414s
user    5m18.324s
sys     0m0.264s

icc -O2 pgo

real    5m23.997s
user    5m12.680s
sys     0m0.284s

icc -O2

real    5m18.919s
user    5m12.224s
sys     0m0.212s

Clearly we're not helping here, but -O2 seems to be the place to start. Next step was to use a profiler to see where to start. q-syscollect showed me what was taking up the most time.

icc -O2 profile

% time      self     cumul     calls self/call  tot/call name
 51.87    147.51    147.51     50.2M     2.94u     2.94u triscore
 33.83     96.20    243.72     46.6M     2.07u     2.07u biscore
 10.36     29.47    273.19     13.1M     2.25u     2.27u icscore
  2.75      7.83    281.02         -         -         - hillclimb

gcc -O3 profile

% time      self     cumul     calls self/call  tot/call name
 54.71    102.35    102.35     49.9M     2.05u     2.05u triscore
 26.69     49.93    152.29     46.4M     1.08u     1.08u biscore
 12.69     23.74    176.03     13.1M     1.81u     1.89u icscore
  2.54      4.75    180.78     14.5k      327u      327u init_path_lookup_H_M3
  1.74      3.26    184.04         -         -         - hillclimb

So, triscore seems to be the place to start. Trying to handle code that runs for 5 minutes doesn't lead to a good debugging experience, so I pulled triscore out so I could analyse it separately.

$ time ./triscore-gcc

real    0m1.399s
user    0m1.384s
sys     0m0.004s

$ time ./triscore-icc

real    0m2.346s
user    0m2.316s
sys     0m0.000s

Ok, now we have a small test case that runs in a few seconds. The code I used is available here.

Time to pull out Caliper and start peeking under the hood.

Caliper dcache output gcc
=========================
L1 data cache miss percentage:
   9.79 = 100 * (L1D_READ_MISSES.ALL / L1D_READS)

Percent of data references accessing L1 data cache:
  99.89 = 100 * (L1D_READS / DATA_REFERENCES)

Caliper dcache output icc
=========================
L1 data cache miss percentage:
  17.63 = 100 * (L1D_READ_MISSES.ALL / L1D_READS)

Percent of data references accessing L1 data cache:
  81.76 = 100 * (L1D_READS / DATA_REFERENCES)

We can see the average latency for the misses:

% Total                                        Avg.
 Dcache  Cumulat        Sampled       Dcache  Dcache
Latency    % of          Dcache      Latency  Laten.
 Cycles    Total         Misses       Cycles  Cycles   Load Module
------------------------------------------------------------------
100.00   100.00            1919        12664     6.6   triscore-icc
100.00   100.00             995         6919     7.0   triscore-gcc

Which lets us approximate how much extra time we spend waiting for cache.

gcc
L1D_READ_MISSES.ALL     x___   0       91013769

icc
L1D_READ_MISSES.ALL     x___   0      166681605

With a cycle taking ~1ns, average latency of 7 cycles (166681605 - 91013769) * 1 * 9 = ~ 0.6 seconds.

Let's have a look at what the TLB thinks about all this.

icc
===
Percentage of data references covered by L1 and L2 DTLB:
  99.02 % = 100 * (1 - L2DTLB_MISSES / DATA_REFERENCES)

Percentage of data references covered by the HPW:
   0.00 % = 100 * (DTLB_INSERTS_HPW / DATA_REFERENCES)

Percentage of data references covered by software trap:
   0.98 % = 100 * ((L2DTLB_MISSES - DTLB_INSERTS_HPW) / DATA_REFERENCES)

gcc
===
Percentage of data references covered by L1 and L2 DTLB:
  100.00 % = 100 * (1 - L2DTLB_MISSES / DATA_REFERENCES)

Percentage of data references covered by the HPW:
   0.00 % = 100 * (DTLB_INSERTS_HPW / DATA_REFERENCES)

Percentage of data references covered by software trap:
   0.00 % = 100 * ((L2DTLB_MISSES - DTLB_INSERTS_HPW) / DATA_REFERENCES)

Software trap? That sounds slow. We can ask Caliper to show us the instructions that are causing the software trap.

Function Details
------------------------------------------------------------------------------------------------------
% Total                                                 %       %       %
Sampled       Sampled      DTLB      DTLB      DTLB   DTLB    DTLB    DTLB          Line|
   DTLB          DTLB       L2        HPW      Soft    L2      HPW    Soft          Slot|  >Statement|
 Misses        Misses     Fills     Fills     Fills   Fill    Fill    Fill     Col,Offset  Instruction
------------------------------------------------------------------------------------------------------
100.00     [test-icc::triscore, 0x40000000000008c0, triscore.c]
                 600       429         0       171    71.5     0.0    28.5              6  Function Totals
         ----------------------------------------------------------------------------------------------
[** cut out other stuff **]
                 144         0         0       144     0.0     0.0   100.0       0x0c70:0    M_      lfetch        [r87] ;;
                   0         0         0         0     0.0     0.0     0.0             :1    M       lfetch        [r67]
                   0         0         0         0     0.0     0.0     0.0             :2    I       nop.i         0

Ah, huh! lftech is an operation to pre-fetch data you know you're about to use into the cache. After discussions with the Caliper people, this doesn't actually raise a software fault to the operating system, but does actually go to memory via the hashed page table walker (more on that later). Let's check out that load.

(gdb) break *(triscore + 0xc70)
Breakpoint 1 at 0x4000000000001530: file triscore.c, line 26.
(gdb) r
Starting program: /home/ianw/programs/enigma-test/extract/enigma-test/triscore-icc

Breakpoint 1, 0x4000000000001530 in triscore (stbrett=0x0, ciphertext=0x0, len=0) at triscore.c:26
26        for (i = 2; i < len-15; i += 16) {
(gdb) print/x $r87
$1 = 0x600010010fe8f9c0
(gdb) print/x *($r87)
Cannot access memory at address 0x600010010fe8f9c0
(gdb) print/x $r67
$2 = 0x6000000000005be8
(gdb) print/x *($r67)
$3 = 0x0

Well, that address in r87 looks really bogus. It seems that what is happening is that ICC is being very aggressive with its prefetching algorithm, to the point it is getting it wrong. This brings us close to the answer, but we can just check out a few other things.

Firstly, these programs are using a lot of local registers. Maybe it is register stack engine traffic; after all maybe ICC is more aggressive with register allocations.

-- gcc -------------------------------------------
                            PLM
Event Name                 U..K  TH          Count
--------------------------------------------------
BACK_END_BUBBLE.ALL        x___   0      249132805
BE_RSE_BUBBLE.ALL          x___   0           3727
BE_RSE_BUBBLE.OVERFLOW     x___   0           1792
BE_RSE_BUBBLE.UNDERFLOW    x___   0           1935
--------------------------------------------------

-- icc -------------------------------------------
                            PLM
Event Name                 U..K  TH          Count
--------------------------------------------------
BACK_END_BUBBLE.ALL        x___   0      440096824
BE_RSE_BUBBLE.ALL          x___   0           4128
BE_RSE_BUBBLE.OVERFLOW     x___   0           1792
BE_RSE_BUBBLE.UNDERFLOW    x___   0           2336
--------------------------------------------------

Nup; but we can see that the pipeline stalls are coming from the backend (BACK_END_BUBBLE.ALL). This mostly rules out bad instruction scheduling (i.e. the processor doesn't have enough to do) and probably shows why IPO didn't make much difference before.

By poking into the backend pipeline even more, we can see overhead is really coming from increased TLB activity.

-- gcc --------------------------------------------------
                                   PLM
Event Name                        U..K  TH          Count
---------------------------------------------------------
BE_L1D_FPU_BUBBLE.L1D_DCURECIR    x___   0      100968826
BE_L1D_FPU_BUBBLE.L1D_HPW         x___   0           8937
BE_L1D_FPU_BUBBLE.L1D_L2BPRESS    x___   0           1358
BE_L1D_FPU_BUBBLE.L1D_TLB         x___   0        6354270
---------------------------------------------------------

-- icc --------------------------------------------------
                                   PLM
Event Name                        U..K  TH          Count
---------------------------------------------------------
BE_L1D_FPU_BUBBLE.L1D_DCURECIR    x___   0      431700491
BE_L1D_FPU_BUBBLE.L1D_HPW         x___   0      181649893
BE_L1D_FPU_BUBBLE.L1D_L2BPRESS    x___   0           3207
BE_L1D_FPU_BUBBLE.L1D_TLB         x___   0       33004161
---------------------------------------------------------

Here we have confirmation of what appears to be going on. The time spent waiting for the hardware page table walker is orders of magnitude greater with ICC.

On Itanium, there is a cache of the page tables kept in memory acting as a third level TLB. When a TLB miss is taken first the 1st level TLB is checked, then the larger and slower second level, and finally the hardware goes to main memory and checks the "virtually hashed page table". Obviously this third level, being in system memory and lower down the memory hierarchy is much slower.

The lfetch instruction is asking to bring in a bogus cache line for a far out virtual address. This requires a TLB mapping; which obviously doesn't exist. So the normal TLB fault path takes over, finally ending up going to main memory for the hashed page table check. This hashed page table page also has to be moved into the cache, which causes even more cache misses.

At this point, the processor can't find the mapping and gives up; only if it is an aggressive lfetch.fault will the fault go through to the operating system (which then walks the OS page tables).

As a final step, let's recompile and run the Enigma test with -no-prefetch.

real    3m39.243s
user    3m33.648s
sys     0m0.176s

Still about 9 seconds slower than gcc, but close enough for me! I still think that with this code, ICC is being too smart.

So, moral of the story? Firstly, check your code actually does run faster with another compiler (I only started testing with the assumption I would find out how much faster ICC was). Secondly, if your code is running slowly check for aggressive prefetching of cache lines, it might be doing more damage than harm by wasting time looking up TLB entries that never exist. Thirdly, Caliper (and the underlying perfmon and Itanium PMU system) is really cool, and next time someone says Itanium is crap ask them how they would track that problem down on processor de jour!

moinmoin RSS import macro

RSSReader.py is a macro I wrote to import an RSS feed into MoinMoin. I do realise this voids the point of a Wiki in that text is editable and fluid, but on the other hand it seems logical to keep related things presented within your Wiki, even if they are generated externally. I think this is safer than XSLT scripts to present the RSS for a public Wiki, since MoinMoin doesn't seem to offer a great deal of flexibility restricting arbitrary code injection.

This whole idea was made possible to conceive and build within an few hours thanks to the Universal Feed Parser. Vive le logiciel libre!

Linux powered ice cream

Now the next time someone emails me at work asking about machines to make gelato (it happens more than you would think), I can point them at the MooBella on-demand ice cream maker.

numbers2ppm

I'm not sure how to describe this best, but numbers2ppm.py is a Python script to turn a list of numbers into a (plain format) PPM image filled with coloured boxes. Perhaps an example is best.

$ cat test.in
01234567899876543210

$ ./numbers2ppm.py -W 40 -H 40 -c 10 ./test.in > test.ppm

$ convert test.ppm test.png

You should end up with

numbers2ppm script output

If you're me, you could use something like this to read a dump of reference counts of physical frames of memory dumped from the kernel, creating a nice graphical view of memory usage and sharing. I imagine it may come in handy for other things too.