I am working on an RSS aggregator. I want it to get files for me. That's all. Put them where I want them. Is that so much to ask? I think it is, so I had to do it myself. All the better, I can customize it a lot this way. (It's written in Python using feedparser, I will post it to this blog when I'm done, probably public domain.) I want it to download my music for immediate initiation to my library, videos too perhaps, images definitely (comics; probably for a desktop changer later), and also maybe text; not sure what for yet.
So at this point I have it pretty much working. Some of the nice things I don't quite have yet; filtering for certain entries, deleting files, putting the timestamp in the filename to further prevent collisions, etc. But for now I can use it, and it made me happy. Until, xkcd gave me this:
This, and the html page (linked from the feed), are the only reference to the actual png. No direct link, so I have to now parse this bit of html, or the link that the rss feed points to. No way am I going to make a custom bit of code just for some emo nerd comic. (the fanboys already made me irritated with this comic, so it's convenient) But, it turns out that Perry Bible Fellowship does the same thing; the supposed image link is actually an html link to a page that contains the image. All the podcasts have direct links to the mp3 files; after all, podcast players like to have direct links. I guess there aren't enough, if any, comic readers yet.
And this is worse than parsing the feed, because there's no standard place to look for the link within the HTML; it's different for every comic. And the HTML page is subject to change at any moment. At least there's a standard in RSS. Some RSS feeds may have stuff in non-standard places, but I made a very easy way to define a non-standard place to look for stuff*, and I don't think they'll move it. And XML parsing is less error prone that HTML.
Well, fortunately there's a couple html parsers for Python, I'll probably use pullparser. My hope now is that it's as easy as feedparser, and I can make my custom settings a similar way*. Though, there are bound to be multiple images in the html page, so I have to somehow identify the one I want. Perhaps by tag id, or maybe the directory that holds the image. God help me if I have to write regexes; I want this to be simple, it's weird enough that my configs are a list of dictionaries in Python.
*Here's how I have it get files in non-standard fields. feedparser puts the whole feed xml file into a tree of dictionaries and lists. Within each entry, your podcast (for instance) item is usually at entry["enclosures"][0]["href"]. This is what I have as the default. But what if I wanted something at entry["link"][1]["href"] (semi-madeup example). My config file is written in Python too. I have a list of dictionaries that define each feed. To my dictionary I add "url-basis":["link", 1, "href"], and it knows to look at entry["link"][1]["href"]. It's very convenient that in Python you can say x[a], and it will try to treat x as a dictionary or list, depending on what type a is (and of course, throw an exception if you guessed wrong).
No comments:
Post a Comment