Crikey Cleaner

I quite enjoy reading the thinking man's New Idea that is Crikey. I do not, however, enjoy the HTML of their daily email; for a new-media company it is very 1996. For example, every paragraph has what should be done once in a style-sheet (although there is a weird mix of using CSS and not), multiple breaks are used to add space and links are hidden behind a nasty re-director that gives you absolutely no clue where you're going to end up. You can choose plain text but it comes so mangled, with included random Unicode characters, that it's more hassle than it's worth. The W3C validator is a sea of red, should you try and run it through that; not that it really knows what to do because there's no DOCTYPE declaration. Here's a particularly bad sample:

<p style="font-size: 14px; font-family: Arial, Helvetica, sans-serif"> <i class="credit">Sophie Black writes:</i> <br> <br> <p style="font-size: 14px; font-family: Arial, Helvetica, sans-serif"> Although the AFP is reporting that the <a href="http://tracking.crikey.com.au/LinkRedirector.aspx?clid=3976b864-c145-409c-bfd6-6ffab8303e50&rid=2ad7a942-b9db-4727-8ab3-078bdec50e97" target="_blank">noon deadline</a> ... </p>

As given, the bad HTML and huge advertisements in big fixed sized tables make it all but impossible to read on a Palm Pilot screen, which is my preferred medium for the bus. Thus I wrote clean-crikey.py to strip out all the crap and leave you with something reasonable (and as a bonus save you anywhere up to 40KiB).

What I now do is run the daily email through this, and rsync it to a private place which is subscribed to via an Avantgo user-created channel. Then I just sync using the built-in wireless of the T|X and I'm done. But crikey, it shouldn't be that hard! Just send valid HTML with a range of sensible style-sheets...

Update: in response to a letter requesting a way to remove all articles by Christian Kerr published in today's Crikey, I have updated the script to strip articles with a particular byline if -b is passed.