Using Python to create a vector space model for text,
We need to start thinking about how to convert a set of texts into quantifiable things. The simplest method is to consider word frequency.
I will try not to use NLTK and Scikits-Learn packages. First, we will use Python to explain some basic concepts.
Basic Term Frequency
First, let's review how to get the number of words in each document: A Word Frequency Vector.
#examples taken from here: http://stackoverflow.com/a/1750187 mydoclist = ['Julie loves me more than Linda loves me','Jane likes me more than Julie loves me','He likes basketball more than baseball'] #mydoclist = ['sun sky bright', 'sun sun bright'] from collections import Counter for doc in mydoclist: tf = Counter() for word in doc.split(): tf[word] +=1 print tf.items()[('me', 2), ('Julie', 1), ('loves', 2), ('Linda', 1), ('than', 1), ('more', 1)][('me', 2), ('Julie', 1), ('likes', 1), ('loves', 1), ('Jane', 1), ('than', 1), ('more', 1)][('basketball', 1), ('baseball', 1), ('likes', 1), ('He', 1), ('than', 1), ('more', 1)]
Here we introduce a new Python object called Counter. This object is only valid in Python2.7 and later versions. Counters is very flexible. You can use them to complete the function of counting in a loop.
Based on the number of words in each document, we made the first attempt to quantify the document. However, for those who have learned the concept of "vector" in the vector space model, the results of the first attempt to quantify cannot be compared. This is because they are not in the same vocabulary.
What we really want is that the quantitative results of each document have the same length, and the length here is determined by the total vocabulary of our corpus.
import string #allows for format() def build_lexicon(corpus): lexicon = set() for doc in corpus: lexicon.update([word for word in doc.split()]) return lexicon def tf(term, document): return freq(term, document) def freq(term, document): return document.split().count(term) vocabulary = build_lexicon(mydoclist) doc_term_matrix = []print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'for doc in mydoclist: print 'The doc is "' + doc + '"' tf_vector = [tf(word, doc) for word in vocabulary] tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector) print 'The tf vector for Document %d is [%s]' % ((mydoclist.index(doc)+1), tf_vector_string) doc_term_matrix.append(tf_vector) # here's a test: why did I wrap mydoclist.index(doc)+1 in parens? it returns an int... # try it! type(mydoclist.index(doc) + 1) print 'All combined, here is our master document term matrix: 'print doc_term_matrix
Our word vectors are [me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]
The word frequency vector of "Julie loves me more than Linda loves me" is: [2, 0, 1, 0, 0, 2, 0, 1, 0, 1]
The word frequency vector of "Jane likes me more than Julie loves me" is: [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1]
The word frequency vector of "He likes basketball more than baseball" is: [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]
In combination, the word matrix of our main document is:
[[2, 0, 1, 0, 0, 2, 0, 1, 0, 1], [2, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 1, 0, 0, 0, 1, 1]
Well, it seems reasonable. If you have any machine learning experience, what you just saw is creating a feature space. Every document is in the same feature space, which means that we can represent the entire corpus in the same dimension space without losing too much information.
Normalize the vector so that its L2 norm is 1
Once you get data in the same feature space, you can start to apply machine learning methods such as classification and clustering. But in fact, we also encountered some problems. Words do not all contain the same information.
If some words appear too frequently in a single file, they disrupt our analysis. We want to scale the proportion of each word frequency vector to make it more representative. In other words, vector standardization is required.
We really don't have time to talk too much about mathematics in this area. Now we only accept the fact that we need to ensure that the L2 norm of each vector is equal to 1. Here are some code to show how this is implemented.
import math def l2_normalizer(vec): denom = np.sum([el**2 for el in vec]) return [(el / math.sqrt(denom)) for el in vec] doc_term_matrix_l2 = []for vec in doc_term_matrix: doc_term_matrix_l2.append(l2_normalizer(vec)) print 'A regular old document term matrix: 'print np.matrix(doc_term_matrix)print '\nA document term matrix with row-wise L2 norms of 1:'print np.matrix(doc_term_matrix_l2) # if you want to check this math, perform the following:# from numpy import linalg as la# la.norm(doc_term_matrix[0])# la.norm(doc_term_matrix_l2[0])
The formatted word matrix of the old document:
[[2 0 1 0 0 2 0 1 0 1 1][2 0 1 0 1 1 1 0 0 1 1][0 1 0 1 1 0 0 0 1 1 1]]
The document word matrix with L2 norm 1 calculated by row:
[[ 0.57735027 0. 0.28867513 0. 0. 0.577350270. 0.28867513 0. 0.28867513 0.28867513][ 0.63245553 0. 0.31622777 0. 0.31622777 0.316227770.31622777 0. 0. 0.31622777 0.31622777][ 0. 0.40824829 0. 0.40824829 0.40824829 0. 0.0. 0.40824829 0.40824829 0.40824829]]
Pretty good. Without a deep understanding of linear algebra, you can immediately see that we scaled down each vector so that every element of them is between 0 and 1, and will not lose too much valuable information. As you can see, the value of a word with a count of 1 in one vector is no longer the same as the value in another vector.
Why do we care about this standardization? Considering this situation, if you want to make a document look more relevant to a specific topic than it actually does, you may repeatedly repeat the same word, to increase the possibility of including a topic. Frankly speaking, to some extent, we get a result that degrades the information value of the word. Therefore, we need to scale down the values of words that frequently appear in a document.
IDF Frequency Weighting
We have not yet obtained the desired result. Just as all words in a document do not have the same value, and not all words in the document have the same value. We try to adjust the weight of each word by using the anti-document term frequency (IDF. Let's take a look at what this includes:
def numDocsContaining(word, doclist): doccount = 0 for doc in doclist: if freq(word, doc) > 0: doccount +=1 return doccount def idf(word, doclist): n_samples = len(doclist) df = numDocsContaining(word, doclist) return np.log(n_samples / 1+df) my_idf_vector = [idf(word, mydoclist) for word in vocabulary] print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'print 'The inverse document frequency vector is [' + ', '.join(format(freq, 'f') for freq in my_idf_vector) + ']'
Our word vectors are [me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]
Anti-document Word Frequency vectors: [1.609438, 1.386294, 1.609438, 1.386294, 1.609438, 1.609438, 1.386294, 1.386294, 1.386294]
Now, for each word in a word, we have an information value in the conventional sense to explain their relative frequency in the entire corpus. Recall that this information value is a "inverse "! That is, words with smaller information values appear more frequently in the corpus.
We are about to get the desired result. To get the TF-IDF weighted word vector, you must do a simple calculation: tf * idf.
Now let's take a step back. Recall linear algebra: If you multiply an AxB vector by another AxB vector, you will get a vector of the AxA size or a scalar. We will not do this because we want a word vector with the same dimension (1 x number of words), and each element in the vector has been weighted by its own idf weight. How can we implement such computing in Python?
Here we can write a complete function, but we will give an introduction to numpy.
import numpy as np def build_idf_matrix(idf_vector): idf_mat = np.zeros((len(idf_vector), len(idf_vector))) np.fill_diagonal(idf_mat, idf_vector) return idf_mat my_idf_matrix = build_idf_matrix(my_idf_vector) #print my_idf_matrix
Great! Now we have converted the IDF vector into a matrix of BxB. the diagonal line of the matrix is the IDF vector. This means that we can use the inverse document word frequency matrix to multiply each word frequency vector. Next, to ensure that we also consider words that appear too frequently in the document, we will standardize the vectors of each document so that their L2 norm is equal to 1.
doc_term_matrix_tfidf = [] #performing tf-idf matrix multiplicationfor tf_vector in doc_term_matrix: doc_term_matrix_tfidf.append(np.dot(tf_vector, my_idf_matrix)) #normalizingdoc_term_matrix_tfidf_l2 = []for tf_vector in doc_term_matrix_tfidf: doc_term_matrix_tfidf_l2.append(l2_normalizer(tf_vector)) print vocabularyprint np.matrix(doc_term_matrix_tfidf_l2) # np.matrix() just to make it easier to look atset(['me', 'basketball', 'Julie', 'baseball', 'likes', 'loves', 'Jane', 'Linda', 'He', 'than', 'more'])[[ 0.57211257 0. 0.28605628 0. 0. 0.572112570. 0.24639547 0. 0.31846153 0.31846153][ 0.62558902 0. 0.31279451 0. 0.31279451 0.312794510.26942653 0. 0. 0.34822873 0.34822873][ 0. 0.36063612 0. 0.36063612 0.41868557 0. 0.0. 0.36063612 0.46611542 0.46611542]]
Great! You just saw an example showing how tedious to create a TF-IDF weighted document word matrix.
The best part is: You don't even need to manually calculate the above variables, just use scikit-learn.
Remember, everything in Python is an object. The object itself occupies the memory and the time it takes to perform operations on the object. Use the scikit-learn package to ensure that you do not have to worry about the efficiency of all the previous steps.
Note: The value you get from TfidfVectorizer/TfidfTransformer is different from the value we calculated manually. This is because scikit-learn uses an improved version of Tfidf to handle the error of division by zero. Here is a more in-depth discussion.
from sklearn.feature_extraction.text import CountVectorizer count_vectorizer = CountVectorizer(min_df=1)term_freq_matrix = count_vectorizer.fit_transform(mydoclist)print "Vocabulary:", count_vectorizer.vocabulary_ from sklearn.feature_extraction.text import TfidfTransformer tfidf = TfidfTransformer(norm="l2")tfidf.fit(term_freq_matrix) tf_idf_matrix = tfidf.transform(term_freq_matrix)print tf_idf_matrix.todense()Vocabulary: {u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}[[ 0. 0. 0. 0. 0.28945906 0.0.38060387 0.57891811 0.57891811 0.22479078 0.22479078][ 0. 0. 0. 0.41715759 0.3172591 0.31725910. 0.3172591 0.6345182 0.24637999 0.24637999][ 0.48359121 0.48359121 0.48359121 0. 0. 0.367783580. 0. 0. 0.28561676 0.28561676]]
In fact, you can use a function to complete all the steps: TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(min_df = 1)tfidf_matrix = tfidf_vectorizer.fit_transform(mydoclist) print tfidf_matrix.todense()
[[ 0. 0. 0. 0. 0.28945906 0.0.38060387 0.57891811 0.57891811 0.22479078 0.22479078][ 0. 0. 0. 0.41715759 0.3172591 0.31725910. 0.3172591 0.6345182 0.24637999 0.24637999][ 0.48359121 0.48359121 0.48359121 0. 0. 0.367783580. 0. 0. 0.28561676 0.28561676]]
In addition, we can use this vocabulary space to process new observation documents, like this:
new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball']new_term_freq_matrix = tfidf_vectorizer.transform(new_docs)print tfidf_vectorizer.vocabulary_print new_term_freq_matrix.todense()
{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}[[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0.0. 0. 0. 0. ][ 0. 0.68091856 0. 0. 0.51785612 0.517856120. 0. 0. 0. 0. ][ 0.62276601 0. 0. 0.62276601 0. 0. 0.0.4736296 0. 0. 0. ]]
Note that there is no such word as "watches" in new_term_freq_matrix. This is because the document we use for training is a document in mydoclist, which does not appear in the vocabulary of that corpus. In other words, it is out of our vocabulary dictionary.
Back to Amazon comments
Exercise 2
Now it's time to try what you 've learned. Using TfidfVectorizer, you can try to create a TF-IDF-weighted document word moment on the string list of Amazon comments text.
import osimport csv #os.chdir('/Users/rweiss/Dropbox/presentations/IRiSS2013/text1/fileformats/') with open('amazon/sociology_2010.csv', 'rb') as csvfile: amazon_reader = csv.DictReader(csvfile, delimiter=',') amazon_reviews = [row['review_text'] for row in amazon_reader] #your code here!!!