Spam cleaning an mbox file

The problem is that I have a large mbox file full of messages which I want to strip spam from with spamassassin. Ordering of the messages is important (i.e. they must go in and come out the other end in order, with identified spam removed).

My solution is to run the following script from formail thusly

$ formail < in-mbox -s /home/ianw/spam-clean.py spam-free-box
#!/usr/bin/python
import os, sys, popen2

if len(sys.argv) != 2:
    print "usage: spam-clean.py output"
    sys.exit(1)

#read in message from stdin.
message = sys.stdin.read()

sa = popen2.Popen3("/usr/bin/spamc -c")
sa.tochild.write(message)
sa.tochild.close()

if sa.wait() != 0:
    print "discarding spam ..."
    sys.exit(1)
else:
    print "ok ..."
    f = open(sys.argv[1], 'a')
    f.write(message)
    f.close()

It's slow, but a little profiling shows me that most of the time is spent asleep (i.e. in the wait() call). One neat project would be to daemonise this and take messages in a fifo manner so we could run a few spamassassins at the same time but maintain message order.

Hey, you can see my house from here!

Hey, you can see my house from here! (at least until Google changes something).

update: it only took them a day to break it, but I did notice they embed a "Powered by Google" icon now and a link to the terms of use. The terms of use on the photo imagery seem to rule out a nice API being on the cards, but it would ceratinly open the door to some really cool stuff.

update 2: Egg on my face!

But in the terms and conditions you get if you click on the "Terms and conditions" link displayed on your map, it still says

The photographic imagery made available for display through Google maps is provided under a nonexclusive, non-transferable license for use only by you. You may not use the imagery in any commercial or business environment or for any commercial or business purposes for yourself or any third parties.

However, the API terms and conditions says

1.2 Photographic Imagery. The Google map images accessible to you through the Service may contain photographic imagery. Your use of this photographic imagery is limited to displaying it to end users within the Service itself, and in the same manner, form, format, and appearance as it is provided by the Service. You may not, nor may you allow others to, copy, distribute, display, alter, or otherwise use, this photographic imagery except as it is provided to you through the Service. Google reserves the sole right and discretion to determine whether your display of photgraphic images through the Service is in conformance with this Section, and also reserves the right to terminate or suspend your access to photographic imagery at any time for any reason, without notice.

Obviously the intent is that you can embed a map in your page but don't fiddle with it too much. But it would be nice if the terms and conditions were cleared up.

Zeller's Congruence

Here's one for your junkcode if you haven't already come across it (maybe I'm the only one). Zeller's Congruence (or rule, or algorithm, or ...) allows you to find the day of the week for any given date. Most people would probably use mktime(), but it recently came up on a glibc list where a guy was doing millions of calls; it can get pretty slow.

If you're keen, there is an explanation of how to derive (one version of) it. The most trouble free version I could find looks like

/* Zeller's Congruence to determine day of week */
int get_dayofweek(int date)
{

 int d = date % 100;
 int m = (date % 10000) / 100;
 int y = date / 10000;

 if (m < 3)
 {
     m = m + 12 ;
     y = y - 1;
 }

 return ((2 + d + (13*m-2)/5 + y + y/4 - y/100 + y/400) % 7);

}

int main(void)
{

    /* my birthday */
    int bday = get_dayofweek(19800110);
    char *d;

    switch(bday)
    {
    case 0:
        d = "Sunday";
        break;
    case 1:
        d = "Monday";
        break;
    case 2:
        d = "Tuesday";
        break;
    case 3:
        d = "Wednesday";
        break;
    case 4:
        d = "Thursday";
        break;
    case 5:
        d = "Friday";
        break;
    case 6:
        d = "Saturday";
        break;
    }

    printf("%s\n", d);
}

So it looks like I was born on a Thursday. Cool!

Super Co-Contributions

The Australian government has a scheme where they will contribute up to $1.50 for every dollar you save in superannunation.

The maximum amount they will match for your income goes down for 5c after every $1 over $28,000, and is capped at $1500. The official calculator annoyingly doesn't tell you the minimum amount you should invest to get the maximum return (the brochure shows it with less granularity). It's hardly rocket science but the below Python program should do that for you.

#figure out super co-contributions
import sys

def best_contribution(income):
    greatest_cocontribution = 1500 - ((income - 28000) * 0.05)
    print "$%7d | %5d | %5d " % (income,
                                 greatest_cocontribution / 1.5,
                                 greatest_cocontribution)

def header():
    print " %7s | %5s | %5s " % ("income", "you", "govt")
    print "---------+-------+-------"


if __name__ == '__main__':
    try:
        if sys.argv[1] == 'all':
            incomes = range(28000, 59000, 1000)
            header()
            for i in incomes:
                best_contribution(int(i))
        else:
            income = int(sys.argv[1])
            if (income < 28000 or income > 58000):
                print "income out of range"
                sys.exit(0)
            header()
            best_contribution(income)
    except:
        print "super.py [income|all]"

As an idea, here is the output for $1000 increments

  income |   you |  govt
---------+-------+-------
$  28000 |  1000 |  1500
$  29000 |   966 |  1450
$  30000 |   933 |  1400
$  31000 |   900 |  1350
$  32000 |   866 |  1300
$  33000 |   833 |  1250
$  34000 |   800 |  1200
$  35000 |   766 |  1150
$  36000 |   733 |  1100
$  37000 |   700 |  1050
$  38000 |   666 |  1000
$  39000 |   633 |   950
$  40000 |   600 |   900
$  41000 |   566 |   850
$  42000 |   533 |   800
$  43000 |   500 |   750
$  44000 |   466 |   700
$  45000 |   433 |   650
$  46000 |   400 |   600
$  47000 |   366 |   550
$  48000 |   333 |   500
$  49000 |   300 |   450
$  50000 |   266 |   400
$  51000 |   233 |   350
$  52000 |   200 |   300
$  53000 |   166 |   250
$  54000 |   133 |   200
$  55000 |   100 |   150
$  56000 |    66 |   100
$  57000 |    33 |    50
$  58000 |     0 |     0

bouncing via mutt

I seem to get messages like Unable to deliver message to the following recipients, because the message was forwarded more than the maximum allowed times when I try to bounce messages via mutt. This is because by the time an email to ianw@ieee.org gets to my inbox it has bounced around a bunch of places like the IEEE and UNSW.

What I would really like is a filter-and-bounce function in mutt to filter out the headers. This involves minimal work from me (as opposed to, say, resending the message). mutt doesn't have this so I have hacked my own.

Firstly, in .muttrc add

set pipe_decode=no
macro index "B" "python2.3 /home/bin/bounce.py"

The pipe decode is important, since otherwise mime messages will get scrambled on the bounce and attachments won't come, etc.

Then just do the bounce with this python script

SENDMAIL = "/usr/sbin/sendmail"
FROMMAIL = "ianw@ieee.org"
SMTPHOST = mail.internode.on.net

import email, sys, smtplib, os

m = email.message_from_file(sys.stdin)
del m['received']

if len(sys.argv) == 2 :
        email = sys.argv[1]
        print "Bouncing to %s" % email
else:
        newstdin = os.open("/dev/tty", os.O_RDONLY)
        os.dup2(newstdin, 0)

        print "Email to send to :",
        sys.stdout.flush()
        email = sys.stdin.readline()

server = smtplib.SMTP(SMTPHOST)
server.sendmail(FROMMAIL, email, m.as_string())
server.quit()

The only tricky bit is having to re-open stdin because mutt sets up a pipe between it and the pipe-message process. You can add your own X-Resent-From type headers if you want.

Death to trailing whitespace

Whitespace at the end of lines is farily annoying, and wastes space. It also generally annoys other developers if you send patches that introduce white space.

Luckily, emacs can show you with big red blocks where you've left whitespace behind. I think everyone (who hasn't already :) should add something like

(mapc (lambda (hook)
      (add-hook hook (lambda ()
          (setq show-trailing-whitespace t))))
      '(text-mode-hook
      c-mode-hook
      emacs-lisp-mode-hook
      java-mode-hook
      python-mode-hook
      shell-script-mode-hook))

to their .emacs file right now.

Finding the parent function in emacs

Have you ever been in the middle of a really long function and wondered just exactly what it was called? Angus Lees came up with

(defun print-defun-name ()
  (interactive)
  (save-excursion
    (beginning-of-defun)
    (beginning-of-line 0)
    (let ((start (point))
          string)
      (end-of-line 2)
      (setq string (buffer-substring start (point)))
      (message "%s" (replace-regexp-in-string "[ \t\n]" " " string)))))

I came up with a slightly different version that works a little better for C code

(defun c-print-defun-name ()
  (interactive)
  (save-excursion
    (c-beginning-of-defun)
    (end-of-line 0)
    (let ((end (point)))
      (c-beginning-of-statement)
      (let ((start (point)))
        (setq string (buffer-substring start end))
        (message "%s" (replace-regexp-in-string "[ \t\n]+" " " string))))))

Add it to your .emacs and if you really want, bind it to a key.

Python gallery generator

It's so hard to find a good gallery generator these days :) All I wanted was something really simple that could show a few photos in a slide-show style interface. Should use static HTML and leave the important configuration upto a style sheet.

All the ones on the market seemed to be a bit of an overkill, so I wrote my own. Of course, a sample is worth a thousand words.