Searching my email

A few days ago I got an email from Google saying “hey, did you know we’ve just added Gmail and Google Calendar to Google Takeout?”

I did not.

Google Takeout is the entirely laudable effort by Google to make it possible to get all the data you have stored in a particular Google service out of that Google service, whether because you want to leave or just because backups are a good idea. I’ve been a Gmail user for quite a long time, and have quite a lot of mail in there, and it’d be nice to have a backup of it. So, I click “create an archive”, and then some hours later1 I get a nudge from Google saying “we’ve created an archive of all your mail, and now you can download it”. So that’s exactly what I did.

what shall we do with the drunken mailbox, err-lie in the mornin’

OK, so I’ve got a 4GB .mbox file of all my mail since 2004.2 It’s good to have a backup. What else can I do with it?

One obvious thing is to point a search engine at it. Gmail is pretty good at searching mail, don’t get me wrong, but it’s nice to be able to search locally without needing internet access, especially since sometimes Gmail goes down3 or my cable connection decides that connections to gmail and twitter should be slow today.4 The clear leader for this seems to be notmuch, which bills itself as “the mail indexer”. Notmuch doesn’t fetch mail, it doesn’t send mail; it just indexes and searches it.

a brief digression into mail storage formats

First step, though, is to put the mail in Maildir format. Google has you download the mail in the standard mbox format: one file, with all your mail in it. Mbox format has been around pretty much exactly as long as there has been electronic email at all: here it is in a man page from 1975. Maildir was invented as a better format in 1995; instead of having one epic file with all your mail in it, you have one folder and each email is a separate file in that folder. This is approximately thirteen billion times easier to deal with for applications, especially those trying to deal with a lot of mail, which notmuch is. So we need to convert the Gmail export mbox into a Maildir. I dropped the Gmail mbox into a folder ~/gmail-backup, and then did mb2md -s gmail-backup. That creates ~/Maildir and puts your stuff in it.

Next, install notmuch, and notmuch setup which walks you through a few basic questions about your mail. Then notmuch new reads and indexes it all. This takes a little while.

a sidebar: “Ignoring non-mail file”

Either gmail or mb2md did something weird: notmuch rejected a whole bunch of my mails because they had a blank second line. If you get the same thing, notmuch will print a bunch of lines like Note: Ignoring non-mail file: /home/myself/Maildir/.All mail Including Spam and Trash_mbox/cur/1234567890.123456.mbox:2,. If that happens, take a look at the file it says it’s ignoring. If it looks like a legitimate email but it’s got a blank second line, then you’ve hit the same problem. I needed something to walk through my mail folder and patch these up, so as usual in these situations I wrote an ultranoddy Python script.

an ultranoddy Python script

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/usr/bin/python
"""mb2md seems to export a bunch of mails from the gmail takeout
   mbox dump with a blank second line. Fix those."""
import glob, os; counter = 0
for f in glob.glob(os.path.expanduser("~") + "/Maildir/.All*/cur/*"):
    fp = open(f); data = fp.read(1024); fp.close()
    lines = data.split("\n")
    if lines[1].strip() == "":
      fp = open(f);data = fp.read(); fp.close(); 
      lines = data.split("\n"); del lines[1]; data = "\n".join(lines)
      fp = open(f, "w"); fp.write(data); fp.close(); counter += 1
      if counter % 50 == 0:
          print "fixed another fifty files: total fixed", counter

Once you’ve done that, touch ~/Maildir; touch ~/Maildir/.All* to let things know that you changed something, and then notmuch new again should read in all the fixed mail (and keep the previously-read lot around too).

There’ll still be a bunch that notmail ignores: gmail (handily) stores chat logs as emails, but (unhandily) these are not actually emails, and notmuch will dislike them. That’s fine.

seek and ye shall find

Now all your mail is searchable. Try notmuch search whatever and, lo, you get all the matching mails. Very cool. Notmuch can handle some pretty complicated searches: check their website for details.

ultranoddy II: this time it’s personal

Of course, I don’t want to ssh into my home server (which is where this stuff is) and type commands to search my mail. So instead I wrote the world’s simplest notmuch web search UI in Python. It is ugly, it doesn’t do formatting properly, it hates foreigners and so smashes Unicode down to question marks, and I don’t care because all I need is to get search results over the web, and it does that fine. There’s notmuch-web, which seems very nice5 but requires notmuch v0.15 or better, and Ubuntu 12.04 only has 0.12. So, once more forth into noddy Python scripts.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
#!/usr/bin/python

import notmuch, BaseHTTPServer, cgi, urlparse, time, json

BASE="""<!doctype html><html><style>
div.thread { margin-left: 2em; }
div[data-match] pre { display: none; }
div[data-match=match] pre { display: block; }
details summary { color: #999; }
details[open] > summary { color: black; }
%s"""
IDX= BASE % """Search: <form><input name="q">"""
MSG = ('<div data-match="%(match)s"><details %(matchopen)s>'
  '<summary>%(from)s %(date)s</summary><pre>%(body)s</pre>'
  '</details></div>')
def format_message(m):
    j = json.loads(m.format_message_as_json())
    dic = {
        "body": cgi.escape("\n".join(
           [x for x in j.get(
             "body", [{"content":"no body"}])[0].get("content", 
              "no content").split("\n") if not x.startswith(">")])),
        "subject": cgi.escape(j["headers"]["Subject"]),
        "from": cgi.escape(j["headers"]["From"]),
        "match": "match" if j["match"] else "",
        "matchopen": "open" if j["match"] else "",
        "date": cgi.escape(j["headers"]["Date"])
    }
    return MSG % dic

def get_message_and_children(m):
    ret = [format_message(m)]
    for child in m.get_replies():
        ret += get_message_and_children(child)
    return ret

class NMHandler(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == "/":
            self.wfile.write(IDX)
        else:
            qs = dict(cgi.parse_qsl(urlparse.urlparse(self.path).query))
            if "q" not in qs:
                self.wfile.write(IDX)
                return
            db=notmuch.Database()
            q=notmuch.Query(db, qs["q"])
            thrs=[x for x in q.search_threads()]
            res = []
            for thr in thrs:
                data = {
                    "authors": thr.get_authors(),
                    "subject": thr.get_subject(),
                    "tid": thr.get_thread_id(),
                    "date": time.asctime(time.gmtime(thr.get_oldest_date())),
                    "msgs": [],
                }
                for m in thr.get_toplevel_messages():
                    msgs = get_message_and_children(m)
                    data["msgs"] += msgs
                data["msgs"] = "\n".join(data["msgs"])
                res.append(data)
            LST = "\n".join([('<li><details><summary>%(subject)s (%(authors)s, '
             '%(date)s)</summary><div class="thread">'
             '%(msgs)s</div></details></li>') % r for r in res])
            out = BASE % ("<ul>%s</ul>" % LST)
            self.wfile.write(out.encode("ascii", "replace"))

def run(server_class=BaseHTTPServer.HTTPServer,
        handler_class=NMHandler):
    server_address = ('', 8411)
    httpd = server_class(server_address, handler_class)
    httpd.serve_forever()

if __name__ == "__main__":
    run()

To be clear, this is pretty horrid. All the HTML is baked into it; it does the absolute bare minimum required. It does what I need it to, though. I just did crontab -e to edit my list of scheduled apps and added @reboot python /home/me/noddy-search-server.py and now I can just connect to http://homeserver:8411 and search my mail. Nice.

  1. modulo that it weirdly didn’t work the first time, as per me mithering on Google+ about it
  2. yes, I know I could have been doing this with offlineimap. I never got around to setting it up, and gmail’s imap implementation is odd because it treats folders as labels, meaning that a message with two labels appears in two imap folders. I might set it up now, though, since I don’t care about the offline imap Maildir other than so that notmuch can index it, and notmuch is clever about finding two mails with the same message ID
  3. vanishingly rare
  4. nowhere near as rare
  5. except for being written in Haskell, but I’m not bigoted