Thursday, December 27, 2007

RSS gripes

I should have a gripes tag.

I am working on an RSS aggregator. I want it to get files for me. That's all. Put them where I want them. Is that so much to ask? I think it is, so I had to do it myself. All the better, I can customize it a lot this way. (It's written in Python using feedparser, I will post it to this blog when I'm done, probably public domain.) I want it to download my music for immediate initiation to my library, videos too perhaps, images definitely (comics; probably for a desktop changer later), and also maybe text; not sure what for yet.

So at this point I have it pretty much working. Some of the nice things I don't quite have yet; filtering for certain entries, deleting files, putting the timestamp in the filename to further prevent collisions, etc. But for now I can use it, and it made me happy. Until, xkcd gave me this:

<img src="http://imgs.xkcd.com/comics/blade_runner.png" title="Blade Runner: Classic, but incredibly slow." alt="Blade Runner: Classic, but incredibly slow." />

This, and the html page (linked from the feed), are the only reference to the actual png. No direct link, so I have to now parse this bit of html, or the link that the rss feed points to. No way am I going to make a custom bit of code just for some emo nerd comic. (the fanboys already made me irritated with this comic, so it's convenient) But, it turns out that Perry Bible Fellowship does the same thing; the supposed image link is actually an html link to a page that contains the image. All the podcasts have direct links to the mp3 files; after all, podcast players like to have direct links. I guess there aren't enough, if any, comic readers yet.

And this is worse than parsing the feed, because there's no standard place to look for the link within the HTML; it's different for every comic. And the HTML page is subject to change at any moment. At least there's a standard in RSS. Some RSS feeds may have stuff in non-standard places, but I made a very easy way to define a non-standard place to look for stuff*, and I don't think they'll move it. And XML parsing is less error prone that HTML.

Well, fortunately there's a couple html parsers for Python, I'll probably use pullparser. My hope now is that it's as easy as feedparser, and I can make my custom settings a similar way*. Though, there are bound to be multiple images in the html page, so I have to somehow identify the one I want. Perhaps by tag id, or maybe the directory that holds the image. God help me if I have to write regexes; I want this to be simple, it's weird enough that my configs are a list of dictionaries in Python.

*Here's how I have it get files in non-standard fields. feedparser puts the whole feed xml file into a tree of dictionaries and lists. Within each entry, your podcast (for instance) item is usually at entry["enclosures"][0]["href"]. This is what I have as the default. But what if I wanted something at entry["link"][1]["href"] (semi-madeup example). My config file is written in Python too. I have a list of dictionaries that define each feed. To my dictionary I add "url-basis":["link", 1, "href"], and it knows to look at entry["link"][1]["href"]. It's very convenient that in Python you can say x[a], and it will try to treat x as a dictionary or list, depending on what type a is (and of course, throw an exception if you guessed wrong).

Wednesday, December 19, 2007

Idea for an anti-phishing plugin

I just thought of a decent anti-phishing scheme, which would probably make a good plugin for Firefox. Particularly useful for those sites that use unicode for evil.

The plugin would give you a button you can push that tells you if you're on one of a whitelist of sites. If you're not, and you thought you were, you're being phished. It would be tedious to whitelist every single site out there, of course. Mainly your important accounts (bank, paypal, email). Listing as many as 20 things sounds like it's worth my time for safety.

Another variation: a small popup (not a dialog you have to click on), or maybe a change of color of a widget on your browser, that comes when you are on a whitelisted site. Sounds a bit backwards at first, but really, you can't get warned when you're on a phishing site unless you have a perfect blacklist (and if there were one, there would be no use for my plugin). My thought is, eventually you'll get used to seeing the popup every time you're at on of your important sites. Then, one day you go to a phishing site. Because of habit, you'll hopefully think that something is a little off when the popup doesn't show, and take notice. Unlike the first variation, I think this variation could work for general carelessness, not just the unicode trick. You would probably combine the two, really.