An RSS reader

A couple of years ago, I went looking for an RSS reader. For those not familiar with the concept, an RSS reader is a piece of software that maintains a list of blogs, pulls their RSS feeds, and displays a list of articles in them. And, if the blog maintainer puts the full text of his articles in the feed, it lets you read them. A decent RSS reader also remembers which articles you’ve read, and either marks them accordingly or only shows you the new ones. For subscribing to blogs that are only occasionally updated (like this one), an RSS reader is almost a necessity: it does the boring work of repeatedly checking for new articles when there seldom are any.

My criteria were:

  1. Good integration with the web browser. I don’t want to flip back and forth between two different programs, one to read the RSS feed, and another to read the things on the web that it points to. I also want hyperlinks in the RSS feed to have the usual color indicators of whether or not that I’ve already read them, which probably won’t work if the RSS reader is a separate program, unless someone makes extraordinary effort at browser integration. Thus the RSS readers I’d tried previously were Firefox extensions. But I didn’t particularly like any of them, because I wanted:

  2. Something better than the usual three-panel view (one panel for a list of blogs, another for a list of articles in a blog, and a third for the article itself). For one thing, that layout requires a lot of clicking: you typically have to click on each blog, then on each article. I can go through blogs faster than that just using the normal browser features: middle-click on each blog in a list to open it in a new tab, then hit Control-W to close each tab when I’m done with it. Also, the three-panel layout wastes a lot of screen real estate: I’m going to spend 99% of my time reading the articles, yet those appear in only one of the three panels. That’s annoying even on a desktop-sized screen, let alone tablets or phones.

  3. No web-based services. I prefer to be in control.

What I fairly quickly landed on was mtve‘s program RSSaggressor. This is a Perl script that takes a list of blogs (a plaintext file, one URL per line, containing the URL of an RSS or Atom feed), checks each, and spits out a long HTML file containing everything new in every blog. The user then views that HTML file in a web browser. This way, instead of all the mouse clicking one does with a three-panel reader, reading the updates is just a matter of scrolling. (Well, except if the blog chooses not to put the full text in the RSS feed; then it’s back to the ‘middle-click to open each article in a new tab’ procedure.) This way there are no browser integration issues, aside from running the program in the first place and then opening its output HTML file in the browser. The original author runs the program as a cron job; I run it whenever I feel like checking what people have to say.

I’ve also made several changes, the notable ones of which are:

1. It displays the author of each article, which is useful for multiple-author blogs.

2. It allows you to specify a condition which has to be met for each article to be added to the output HTML file. To use this, after each URL in the list you add a condition, which is distinguished from the URLs by being indented. For instance, to subscribe to the Volokh Conspiracy weblog, but only to posts by Orin Kerr or Eugene Volokh, not the twenty or so other “co-conspirators”, you can write:
        $author eq "Eugene Volokh" or
        $author eq "Orin Kerr"

The syntax for the condition is Perl syntax; the code uses Perl’s eval() function to evaluate it. The accessible variables are $author, $title, $link (a hyperlink to the web version of the post), and $text (its full text, if you’re lucky). And since the condition can be arbitrary Perl code, you can also do other things with it, including changing any or all of those variables. For instance, Twitter hashad an RSS API, until they decided that offering RSS wasn’t predatory enough, but in it they didn’t publish hyperlinks as hyperlinks, just as plain text. To enable such hyperlinks (or at least nine tenths of them), you can write, for instance (fill in the blank with the screen name of the person to follow):
        $text =~ s((https?://[a-zA-Z0-9_/.+]*[a-zA-Z0-9]))
           (<a href="$1">$1</a>)g ; 1

(If those last two lines look like gobbledygook, welcome to the world of regular expressions.)

This version of RSSaggressor, like the original, can be found on Github. If all my modifications seem to date from the last few days, that’s because I was using an older version of the program previously, and ported my changes to this newer version.

One odd thing about RSSaggressor is that it remembers whether you’ve read articles by storing their MD5 checksum. Thus if the author goes back and edits the article, you get to see it again in its entirety. This has good aspects (you get to see updates to articles you’ve already read) and bad ones (you get to see the whole article again just because the author corrected a typo). I’d prefer to see some sort of diff between the old and new articles, but haven’t found a good diff library that operates on HTML and plays well with Perl. (Suggestions are welcome; patches or pull requests are even more welcome.)

This program hasn’t been packaged up for the casual user who doesn’t know to pull programs from github or install Perl modules (or, on some systems, install Perl itself). Those are not complicated things to do, but instructions would vary depending on the system. (As regards Perl modules, this version of the program might not need any extra ones: the ones it uses are pretty basic, and might already be there.)

Setting text width in HTML

This blog quite intentionally has very little formatting. “Quite intentionally”, because not only does it save my effort, but also lets mobile devices with tiny screens format the text the way they want, without having to fight my formatting. But there’s one piece of formatting code I use: limiting the width of the text column. That is a principle of typesetting that I disliked at first, but eventually accepted: long lines are just too hard to read; the eye too easily loses its place when scanning back to the left to get to the start of the next line.

Though a lot of sites limit text width, usually, from what I’ve seen, it’s done badly:

  • Specifying text width in terms of pixels. This produces annoying results for people with bad eyesight who use huge fonts, and for people who have portable devices with lots of microscopic pixels (such as what Apple calls a “retina display”), and who thus also use huge fonts (that is, huge when measured in pixels). It also can fail for people who have displays narrower than the specified number of pixels, since they can end up with lines that go off the edge of the screen, and need to keep scrolling the screen back and forth for each line that they read.

  • Specifying text width as a proportion of the screen width. This won’t overflow the screen, but may produce columns with annoyingly many or annoyingly few characters.

The best way to specify text width is relative to the font size. HTML provides the “em” unit, which is the width of the character “m”. About 35 of those translates into about 75 characters of average text, which is what Lamport’s LaTeX manual says is the maximum width one should ever use. (Personally, being an exceptionally fast reader, I don’t mind twice that width; but this blog is for other people to read, not for me. And above twice that width, even I start to get annoyed.)

One can set the width using HTML tables to divide up the screen into columns whose width is specified in “em” units; and there’s not too much wrong with that. But a width specified that way might be too large for smaller screens. Fortunately the CSS standard provides a way to set an upper bound on the width, without using tables:

<style type="text/css">
    .foo { max-width:35em }

The above goes in the “head” section of the HTML file. To use that style, one then writes:

<div class="foo">
    Text whose width is to be limited goes here.

It’s simple, and precisely what is needed: it produces a column 35em wide, unless the screen is narrower than that, in which case the column fits the screen. The “class” attribute can also be set for other HTML elements, such as <body> or <p>, so one doesn’t need to add extra <div>s if one doesn’t want to.

Blogging software

The weblog software that people seem to choose by default these days is Wordpress. Wordpress has a lot of features, is widely used and liked, and is offered as a free single-click install by a lot of web hosting providers. But several of the Wordpress blogs I follow have been hacked at some point. When I looked into blogging software, the reason became clear: Wordpress is a large piece of software, written in PHP, a language which originally was designed arose in a world where security concerns were much less significant, and which has addressed those security concerns (and other evolving needs) by adding things, not by a fundamental redesign. (UPDATE: it appears I was being far too generous to PHP in saying that it had been ‘designed’.) The result is a rather large, complicated language, which is hard to learn well enough to master all the security issues. Also, Wordpress uses an SQL database to store weblog entries, comments, and such, which opens up possibilities of SQL injection attacks. The single-click install is easy, but upgrading is not so easy; and if one runs the software for any length of time, one has to upgrade much more often than one has to install.

A lot of other blogging software, too, uses SQL databases to store weblog data. But databases add complexity; for one thing, to back up a database-driven weblog means issuing special commands to back up the database, in addition to doing the normal backup of the weblog’s files. The added complexity might be worthwhile if there were any real need for a database, but there normally are few enough weblog entries that using a file for each one is quite practical; and once written, they seldom change.

I suspect that the reason why blog software commonly uses databases is that PHP makes using SQL easy, and doesn’t make other ways of storing data as easy. In any case, it’s quite inefficient: even though weblog pages hardly ever change, the PHP/SQL combination means that each time a user asks to view a web page, a PHP process gets started up (or woken up), sends queries to an SQL server, receives the results, and rebuilds the web page using them, adding the headers, sidebar, and other formatting that the user has chosen. The sidebars often take further SQL queries. Due to this inefficiency, database-driven blogs are routinely brought to their knees when they draw huge traffic (as in “slashdotting” or “instalanche”). Right when a weblog is getting the most attention is exactly the wrong time for it to fail. There are various optimizations that can improve this — for one thing, PHP can be left running (WSGI) run inside Apache (mod_php) rather than re-started for every request (CGI); and there are also plugins which cache the resulting web pages rather than rebuilding them every time. But installing and maintaining one of those plugins is additional work; and even they don’t bring the efficiency up to the level that static web pages naturally have.

Of course you can easily move a Wordpress blog to, and let them handle issues like caching and keeping the software up to date. That’s how they make their money: by selling advertising on the blogs they host, and/or charging those blogs for premium features. The blogging software they give away is not a revenue source; indeed, if they were to make it too easy to maintain, they’d be sabotaging their revenue source.

I don’t grudge them their revenue — the people who write blogging software do need to eat — but personally, I feel like going to the other extreme. Thus this blog is done in PyBlosxom, a small file-based blogging package written in Python, which I’m using in static-rendering mode, where rather than being run each time someone visits, it is run once and generates all the web pages for the entire blog. PyBlosxom’s default mode has the author writing blog entries in HTML; I’m using a plugin that provides for writing them in Markdown.