Web scraping a friend's blog

A friend who has a blog on a popular blogging farm asked me for help backing up her blog. The site doesn't offer to download a backup, so we'll use Web scraping!

  1. Web scraping > Interactive prototyping
  2. Requirements (features)
    1. Backup what? Plain text of all her posts ever, timestamps, maybe some styling applied (using HTML) to the text, (text of) comments per post, and images embedded in posts.
    2. Periodically harvest new posts and new comments. Automatically?
    3. Archive format? Enable local browsing, ie, convert each post to an HTML document, inserting related comments and images. A frames-based index page could be convenient. What for?
    4. How to deploy (distribute, install, maintain?) the solution? (Pure Python, plus Beautiful Soup, I guess?)
  3. Crawling
    1. Algorithm? Tentatively: fetch front page, find link to first entry (assuming they're in reverse chronological order), harvest that post, then follow link to previous post and harvest recursively. But: don't harvest twice (efficiency), but what if new comments on old entry, or edited (if possible)? Robustness? "Staggering" (throttling, out of civility)?
  4. Parsing the HTML
    1. Site's HTML is horrendous beyond words. Unspeakable, really. We'll probably use BeautifulSoup to extract content, or regular expressions.

      Saved a WAR archive (it's a tarball) of the root page. Unpacked its contents: 65 files, 1.6MiB total. index.html is 846KB (actual text less than 1KB!). 33 small (<2KB) icons.

  5. Operation…



(Appending notes disabled temporarily.)

Last modified 2009-08-06 07:21:21 +0000