corpusifier.py

A script I'm writing to extract text from emails in a Gmail mailbox into a single plain text file. Python, IMAP

We'll use it to record a corpus (hence the ugly name) for the 1,000 Tables event (1KT).

Issues

  1. Extract how? IMAP, for efficiency, because we'll want to scrape the mailbox repeatedly as 1KT unfolds. IMAP lets us download just the index and track which emails we've already read. Many will be accessing that mailbox, so can't delete processed emails nor rely on Gmail's POP3, I guess?
  2. Avoid polling? To improve efficiency and reduce latency, we can use IMAP IDLE.

Resources

Googling produces lots:
  1. Bruno Renié wrote an excellent tutorial.
  2. Piers Lauder's imaplib2 (src) encapsulates a threaded connection. Mostly compatible with Python's standard library imaplib. Many methods optionally accept callbacks so can be non-blocking. The implementations below use this library.
  3. Tim posted an example script using threading based on offlineimap's code. Chris Kirkham's IMAPPush.py is a more complete client.
  4. IMAPClient's API is so much nicer than imaplib's. Its idle_check method blocks (with a timeout)! No threading headaches.
  5. Lots more. I'll add links later…

Solutions

  1. Extract stuff (plain text, headers, etc) going over entire local mailbox, and outputting to single text file and CSV. Scrub the data while at it. Run this periodically, or ad lib.
  2. Listen to new messages using IMAP IDLE. (Cheap and keeps us up-to-date.)
  3. Clone Gmail mailbox in Maildir format, I guess, assuming this'll avoid concurrency problems? (Expensive, so run once.)
    1. goerz-gmailbkp isn't std Maildir.
    2. drewbuschhorn-gmail_imap and ProcImap all seem usable.
    3. Ended up using Ryan Tucker's imap2maildir v1.10.2 r20101018 (GitHub): seems ready-to-use; uses Sqlite?

mailbox module

  1. Testing imap2maildir: edited example config file and saved as imap2maildir.conf to local directory.
  2. First attempt:
    $ ./imap2maildir --verbose
    ...
    imap2maildir: error: Directory '/home/head/Desktop/J14/1,000 Tables/tmp/rtucker-imap2maildir-02a0c05/test-maildir' exists, but it isn't a maildir.
    because sub-directories new, cur, and tmp didn't exist, and needed to distinguish from an mbox (lines 418, 248).
  3. And merging another mailbox into the same directory? Works. How?

Gmail quirks?

  1. OK, rmdir, and again:
    $ ./imap2maildir --verbose --create
    ...
    Exception: folder [Gmail]/All Mail: [NONEXISTENT] Unknown Mailbox: [Gmail]/All Mail (Failure)
  2. Foo? Again:
    $ ./imap2maildir --verbose --create --remote-folder="[Google Mail]/All Mail"
    ...
     NEW: {'uid': 1, 'envfrom': 'gmail-noreply@google.com', 'msgid': '<6a78310c0978541290ffb9673b@mail.gmail.com>', 'envdate': 'Sun, 6 Mar 2005 08:59:18 -0800', 'date': '06-Mar-2005 16:59:18 +0000', 'size': 4303}
    ...

Incremental scraping

  1. imap2maildir's README says "incremental", possibly because defaults to using SEARCH SEEN. So I guess I don't need an IDLE script? At least not urgently?

Completeness

  1. Gmail says some 4,000 messages in mailbox, but downloaded only about 1,800? Ah, imap2maildir defaults to using SEARCH SEEN (by whom? check RFC); use --search=all instead.

Scrubbing data


--
The real world is a special case

Canonical
https://decodecode.net/elitist/corpusifier
Tagged
Updated
Created