This month's monthly challenge theme is NLP, and we'll help you open up a possibility in this article: Use Pandas and Python's Natural language toolkit to analyze your Gmail inbox.
nlp--style projects are full of possibilities:
- Affective analysis is a measure of emotional content such as online commentary, social media, and so on. For example, do tweets about a topic tend to be positive or negative? A news website covers topics that use more positive/negative words, or words that are often associated with certain emotions? Isn't this "positive" yelp comment ironic? (Good luck to the last man!) )
- Analyze the use of language in literature, and then measure the changing trend of vocabulary or writing style over time/region/author.
- tags are garbage content by identifying the key features of the language in use.
- Based on the topics covered by the commentary, the topic extraction is used to classify similar categories.
- Through the corpus of NLTK ' s, the combination of Elastisearch and WordNet is used to measure the similarity of words on the Twitter stream API, thereby creating a better real-time Twitter search.
- Join the NANOGENMO project and generate your own novel in code, and you can start with a lot of ideas and resources.
Load Gmail inbox into pandas
Let's start with the project instance! First we need some data. Prepare your Gmail data archive (including your most recent spam and spam folders).
Https://www.google.com/settings/takeout
Now go for a walk, for a 5.1G mailbox, my 2.8G archive needs to be sent for one hours.
When you get the data and configure the local environment for the project, use the following script to read the data into the pandas (strongly recommends using Ipython for data analysis)
&NBSP
From mailbox import mbox import pandas as PD def store_content (Message, Body=none): if not body:body = message.get_ Payload (decode=true) If Len (message): Contents = {"Subject": message[' subject '] or "," body ": Body," from ": Message[' from '], "to": message[' to ', "date": message[' Date ', "labels": message[' x-gmail-labels '], "Epilogue" : Message.epilogue,} return Df.append (contents, ignore_index=true) # Create a empty dataframe with the relevant CO Lumns df = PD.
Dataframe (columns= ("subject", "Body", "from", "to", "date", "Labels", "Epilogue")) # Import your downloaded mbox file box = mbox (' all mail including Spam and Trash.mbox ') fails = [] for message in Box:try:if Message.get_content_type ( = = ' Text/plain ': df = store_content (message) elif Message.is_multipart (): # Grab any plaintext from multipart me Ssages for part in Message.get_payload (): if part.get_content_type () = = ' Text/plain ': df = store_content (Messa GE, PART.GEt_payload (Decode=true)) break except:fails.append (message)
It uses the Python mailbox module to read and parse messages in mbox format. You can also use a more elegant approach (for example, a message that contains a lot of redundant, repetitive data, like the ">>>" symbol embedded in the reply). Another problem is the inability to handle some special characters, for simplicity, we do discard processing; Make sure you don't ignore the important part of the mailbox at this point.
It should be noted that, in addition to the subject line, we actually do not intend to use other content. But you can do a variety of interesting analyses of the timestamp, the message body, sorting through tags, and so on. Since this is just an article to help you get started (which happens to show results from my own mailbox), I don't want to consider too many details.
Find common words
Now that we've got some data, let's find out the 10 most commonly used words in all the header rows:
# Top most common subject words from
collections import Counter
= subject_word_bag (lambda df.subject.apply . Lower () + ""). Sum ()
Counter (Subject_word_bag.split ()). Most_common () [: Ten]
[(' Re: ', 8508], ('-', 1188), (' The ', 819, (' fwd: ', 666 '), (' to ', 572), (' New ', 530), (' your ', 528), (' for ', 498), (' A ', 463), (' Course ', 452)]
Well, those are too common, so try to make some restrictions on common words:
From Nltk.corpus import stopwords
stops = [Unicode (word) for word in stopwords.words (' 中文版 ')] + [' Re: ', ' fwd: ', '-' ]
subject_words = [word for Word in subject_word_bag.split () if Word.lower () not in stops]
Counter (subject_words) . Most_common () [: Ten]
[(' New ', 530), (' Course ', 452), (' TrackMaven ', 334), (' Question ', 334), (' Post ', 286), (' Content ', 245 ', (' payment ', 244), (' blog ', 241), (' Forum ', 236), (' Update ', 220)]
In addition to manually removing several of the most worthless words, we also used the NLTK word corpus, which requires a dummy installation before use. Now you can see some of the typical words in my inbox, but it's not always typical in English text.
Two-element phrases and collocation words
Another interesting measure that NLTK can perform is the principle of collocation. First, let's take a look at the commonly used "two-yuan phrase", the set of two words that are often joined together in pairs:
From NLTK import collocations
bigram_measures = collocations. Bigramassocmeasures ()
Bigram_finder = collocations. Bigramcollocationfinder.from_words (subject_words)
# Filter to top results otherwise this'll take a LONG time to Analyze
bigram_finder.apply_freq_filter to Bigram in
bigram_finder.score_ngrams (bigram_measures.raw_ FREQ) [: ten]:
print bigram (
' forum ', ' content '), 0.005839453284373725)
((' New ', ' Forum '), 0.005839453284373725)
((' blog ', ' Post '), 0.00538045695634435)
((' domain ', ' names '), 0.004870461036311709)
(' alpha ', ' Release '), 0.0028304773561811506
(' Default ', ' Widget. '), 0.0026519787841697267
(' Purechat: ', ' question '), 0.0026519787841697267)
((' Using ', ' Default '), 0.0026519787841697267)
(' Release ', ' third '), 0.002575479396164831)
((' TrackMaven ', ' Application '), 0.002524479804161567)
We can repeat the same steps for a ternary phrase (or n-ary phrase) to find a longer phrase. In this example, "New forum Content" is the most frequently occurring ternary phrase, but in the list above, it is split into two parts and is ranked at the forefront of the two-yuan phrase list.
Another measure of a slightly different type of collocation is based on mutual information between dots (pointwise mutual information). In essence, it measures a given word that we see in the specified text, relative to the frequency with which they usually appear alone in all documents, and the likelihood of another word appearing. For example, if my email theme uses a lot of the word "blog" and/or "post", then the two-tuple "blog post" is not an interesting signal, because one word may still not appear at the same time as another word. According to this rule, we get a different set of two tuples.
For Bigram in Bigram_finder.nbest (BIGRAM_MEASURES.PMI, 5):
print Bigram
(' 4:30pm ', ' 5pm ')
(' Motley ', ' Fool ') (', '
, ' 900, ')
(' population ', ' cap ')
(' simple ', ' goods ')
So I didn't get a lot of messages about the word "motley" or "fool," but when I saw either of them, "Motley Fool" might be associated.
Emotional analysis
Finally, let's try some emotional analysis. To get started quickly, we can use the NLTK Textblob Library, which provides simple access to a large number of common NLP tasks. We can use its built-in affective analysis (based on patterns) to compute the subject's "Polarity (polarity)". From 1 to 1 of positive emotions, 0 is neutral (lack of a clear signal) for high negative emotions.
Next: Analyze your inbox over a period of time and see if you can sort through the message to determine the basic attributes of the sender/tag/junk of the body. Use latent semantic indexing to uncover the most commonly used general topics covered. Input your Outbox into the Markov model (Markov), generating seemingly coherent automatic responses with POS annotations
Please let us know if you have tried a fun project branch with NLP, including an open source library that will be added points. You can look at the previous show in challenge.hackpad.com to find more inspiration!