I’m on vacation this week so as part of that I figured it would be a good time to try and get into the habit of writing blog posts again. So let’s talk about this project I came up for myself this week.
Some initial challenges to this I predicted:
- what NN do I use?
- how do I featurize the text? is it different from other NLP?
- how do I extract human text (instead of HTML)?
- how do I extract my e-mails from my server?
- which e-mails do I use? do I need to do manual classification?
- how do I remove identifiable information?
- what constitutes identifiable information? (names? phone numbers? companies? roles? skills?)
This led me down a rabbit hole of reading about neural networks and GPT-2 briefly, (which I’ll save that topic for another day), so next up was looking among my e-mails for potential candidates for my training corpus. Thankfully for various reasons I run my own mailserver+filtering instead of using a 3rd-party (also another topic for another day), so it was relatively easy to extract all the raw text from the maildir and start really examining the contents. I used a combination of manually classifying e-mails directed to me, in addition to just pulling all the e-mails I got via LinkedIn.
When it came down to actually looking at e-mail content, it was a mess:
Specifically nowadays e-mails are formatted a little bit like HTTP responses. You have a series of headers at the top that define information about the e-mail itself, (including Content-Type and Content-Transfer-Encoding), followed by a double-newline, followed by the content body. Depending on the e-mail client, you can encounter combinations ranging from nested multipart containing varying levels of html and plaintext, to base64-encoded plaintext. For multipart e-mails, there are even more headers (again you can identify with the double-newline), that may redirect you to a different portion of the multipart by redefining the boundary text. That coupled with inconsistencies in indentation in headers makes overall the problem very entertaining indeed.
I’ve pushed some initial attempts that I’ve only managed to sneak in between drinks and hikes here: https://github.com/worldwise001/experiments/, with a moderate amount of success, but I am confident I will be able to more or less get full plaintext e-mails shortly.