We need to start thinking about how to translate a collection of text into quantifiable things. The easiest way to do this is to consider word frequency.
I will try not to use NLTK and Scikits-learn packages. We first use Python to explain some basic concepts.
Basic frequency
First, let's review how to get the number of words in each document: a frequency vector.
#examples taken from here:http://stackoverflow.com/a/1750187
mydoclist = [' Julie loves me more than Linda loves me ' c3/> ' Jane likes me more than Julie loves me ',
' him likes basketball more than baseball ']
#mydoclist = [' Sun Sky BRI Ght ', ' sun Sun Bright '] from the
collections import Counter for
doc in mydoclist:
tf = Counter () to
word in D Oc.split ():
Tf[word] +=1
print tf.items ()
[(' Me ', 2), (' Julie ', 1), (' Loves ', 2), (' Linda ', 1), (' than ', 1 ), (' More ', 1)]
[(' Me ', 2), (' Julie ', 1), (' Likes ', 1), (' Loves ', 1), (' Jane ', 1), (' than ', 1), (' More ', 1)] [(']
Basketball ', 1), (' Baseball ', 1), (' Likes ', 1), (' he ', 1), (' The ' than ', 1), (' More ', 1)]
Here we introduce a new Python object, known as counter. This object is only valid for Python2.7 and higher versions. Counters are very flexible and you can use them to perform the function of counting in a loop.
Based on the number of words in each document, we made the first attempt to quantify the document. But for those who have studied the concept of "vector" in a vector space model, the results of the first attempt to quantify cannot be compared. This is because they are not in the same vocabulary space.
What we really want is that each document has the same amount of quantitative results, and the length here is determined by the total vocabulary of our corpus.
import string #allows for Format () def Build_lexicon (corpus): Lexicon = set () for Doc in C Orpus:lexicon.update ([Word for Word in doc.split ()]) return to Lexicon def TF (term, document): Return freq (term, do Cument def freq (term, document): Return Document.split (). Count (term) vocabulary = Build_lexicon (mydoclist) doc_ter M_matrix = [] print ' Our vocabulary vector are [' + ', '. Join (list (vocabulary)) + '] ' for doc in Mydoclist:print ' does C is "' + Doc + '" ' Tf_vector = [TF (Word, doc) for word in vocabulary] tf_vector_string = ', '. Join (Format (freq, ' d ') For freq in tf_vector) print "The tf vector for Document%d are [%s] '% ((Mydoclist.index (DOC) +1), tf_vector_string) do C_term_matrix.append (Tf_vector) # Here's a test:why did I wrap Mydoclist.index (DOC) +1 in parens?
It returns an int ... # Try It! Type (Mydoclist.index (DOC) + 1) print ' All combined, this is our master document Term matrix: ' Print Doc_term_matrix
Our word vectors are [me, basketball, Julie, baseball, likes, loves, Jane, Linda, he, than, more]
The word frequency vector for the document "Julie loves me more than Linda loves Me" is: [2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1]
The word frequency vector for the document "Jane likes me more than Julie loves Me" is: [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1]
The word frequency vector for the document "he likes basketball more than baseball" is: [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]
Together, is the word matrix of our main document:
[[2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1], [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]]
Well, it seems quite reasonable. If you have any experience in machine learning, what you just saw is creating a feature space. Now every document is in the same feature space, which means that we can represent the entire corpus in the space of the same dimension without losing too much information.
Standardize the vector so that its L2 norm is 1
Once you get the data in the same feature space, you can start to apply some machine learning methods: classification, clustering, and so on. But in fact, we have also encountered some problems. Words do not all contain the same information.
If some words appear too frequently in a single file, they will disrupt our analysis. We want to scale each word frequency vector so that it becomes more representative. In other words, we need to standardize the vectors.
We really don't have time to talk too much about math in this field. Now just accept the fact that we need to make sure that the L2 norm for each vector equals 1. Here are some code to show how this is implemented.
Import Math
def l2_normalizer (VEC):
denom = Np.sum ([el**2 for El on Vec]) return
[(El/math.sqrt (Denom)) for El in Vec]
doc_term_matrix_l2 = [] for
VEC in Doc_term_matrix:
doc_term_matrix_l2.append ( VEC))
print ' A regular old document term matrix: '
print Np.matrix (doc_term_matrix)
print ' \na document term M Atrix with row-wise L2 norms of 1: '
print Np.matrix (DOC_TERM_MATRIX_L2)
# If your want to check this math, perfor M the following:
# from NumPy import linalg as La
# la.norm (doc_term_matrix[0])
# la.norm (Doc_term_matrix _l2[0])
Formatted old document Word matrix:
[[2 0 1 0 0 2 0 1 0 1 1]
[2 0 1 0 1 1 1 0 0 1 1]
[0 1 0 1 1 0 0 0 1 1-1]]
A document word matrix with a L2 norm of 1, calculated by line:
[[0.57735027 0.0.28867513 0. 0.0.57735027
0 0.28867513 0 0.28867513 0.28867513]
[0.63245553 0 0.31622777 0.0.31622777 0.31622777
0.3 1622777 0. 0.0.31622777 0.31622777]
[0.0.40824829 0.0.40824829 0.40824829 0. 0.
0.0.40824829 0.40824829 0.40824829]]
Well, without too much knowledge of linear algebra, you can see right away that we scaled down the vectors in a proportional way so that each of their elements is between 0 and 1 and does not lose much valuable information. You see, a word with a count of 1 is no longer the same as the value in one vector and its value in another.
Why do we care about this standardization? Considering this, if you want a document to look more relevant to a particular topic than it actually is, you might increase the likelihood that it will be included in a subject by repeating the same word. To be frank, in a way, we get a result of the decay of the value of the word. So we need to scale down the values of the words that appear frequently in a document.
IDF frequency weighting
We don't have the results we want right now. It's like all the words in a document don't have the same value, and not all the words in all the documents have value. We try to use the anti-document frequency (IDF) to adjust each word weight. Let's take a look at what this includes:
def numdocscontaining (Word, doclist):
doccount = 0 for
doc in doclist:
if Freq (Word, doc) > 0:
Doccoun T +=1 return
doccount
def IDF (Word, doclist):
n_samples = Len (doclist)
df = numdocscontaining (Word, DocList) return
np.log (n_samples/1+df)
my_idf_vector = [IDF (Word, mydoclist) for word in vocabulary]
print ' Our vocabulary vector is [' + ', '. Join (list (vocabulary) + '] '
print ' the inverse document frequency vector i s [' + ', '. Join (Format (freq, ' F ') for freq in My_idf_vector) + '] '
Our word vectors are [me, basketball, Julie, baseball, likes, loves, Jane, Linda, he, than, more]
Anti-document frequency vectors are [1.609438, 1.386294, 1.609438, 1.386294, 1.609438, 1.609438, 1.386294, 1.386294, 1.386294, 1.791759, 1.791759]
Now, for each word in the vocabulary, we have an informational value in the general sense that explains their relative frequencies throughout the corpus. Recall that this information value is a "reverse"! That is, the smaller the information value, the more frequently it appears in the corpus.
We're going to get the results we want. In order to get the TF-IDF weighted word vector, you have to do a simple calculation: TF * IDF.
Now let's take a step back and think about it. Think back to linear algebra: if you multiply one axb vector by another axb vector, you'll get a vector with a size of Axa, or a scalar. We're not going to do that because we want a word vector with the same dimension (1 x number of words), and each element in the vector has been weighted by its own IDF weight. How do we implement such calculations in Python?
Here we can write the complete function, but we don't do that, we're going to make an introduction to NumPy.
Import NumPy as NP
def Build_idf_matrix (idf_vector):
Idf_mat = Np.zeros ((len (Idf_vector), Len (Idf_vector))
np.fill_diagonal (Idf_mat, Idf_vector) return
idf_mat
My_idf_matrix = Build_idf_matrix (my_idf_vector)
#print My_idf_matrix
That's great! Now we have converted the IDF vector to the BXB matrix, and the diagonal of the matrix is the IDF vector. This means that we can now multiply each frequency vector by using the anti-document frequency matrix. Then, to make sure we also consider the words that appear too frequently in the document, we will standardize the vectors for each document so that its L2 norm equals 1.
DOC_TERM_MATRIX_TFIDF = []
#performing tf-idf matrix multiplication for
tf_vector in Doc_term_matrix:
doc_ Term_matrix_tfidf.append (Np.dot (Tf_vector, My_idf_matrix))
#normalizing
doc_term_matrix_tfidf_l2 = [] For
Tf_vector in DOC_TERM_MATRIX_TFIDF:
doc_term_matrix_tfidf_l2.append (L2_normalizer (tf_vector))
Print Vocabulary
print Np.matrix (doc_term_matrix_tfidf_l2) # Np.matrix () just to make it easier to look at
set ([' Me ', ' basketball ', ' Julie ', ' baseball ', ' likes ', ' loves ', ' Jane ', ' Linda ', ' he ', ' than ', ' more ']
[0.57211257 0.0. 28605628 0. 0.0.57211257
0 0.24639547 0 0.31846153 0.31846153]
[0.62558902 0 0.31279451 0
0.31279451 0.31279451 0.26942653 0. 0.0.34822873 0.34822873]
[0.0.36063612 0.0.36063612 0.41868557 0. 0.
0.0.36063612 0.46611542 0.46611542]]
That's great! You've just seen an example that shows how to create a TF-IDF weighted document word matrix.
The best part comes: You don't even need to manually compute the variables above, use Scikit-learn.
Remember, everything in Python is an object, the object itself takes up memory, and the object takes time to perform the operation. Use the Scikit-learn package to make sure you don't have to worry about the efficiency of all the previous steps.
Note: The value you get from Tfidfvectorizer/tfidftransformer will be different from the value we manually calculated. This is because Scikit-learn uses an improved version of TFIDF to handle a zero error. Here's a more in-depth discussion.
From Sklearn.feature_extraction.text import countvectorizer
Count_vectorizer = Countvectorizer (min_df=1)
Term_freq_matrix = Count_vectorizer.fit_transform (mydoclist)
print "Vocabulary:", Count_vectorizer.vocabulary_ From
sklearn.feature_extraction.text import tfidftransformer
TFIDF = Tfidftransformer (norm= "L2")
Tfidf.fit (term_freq_matrix)
Tf_idf_matrix = Tfidf.transform (term_freq_matrix)
print Tf_idf_matrix.todense ()
vocabulary: {u ' me ': 8, U ' Basketball ': 1, U ' Julie ': 4, U ' Baseball ': 0, U ' likes ': 5, U ' loves ': 7, U ' Jane ': 3, you ' Linda ': 6, U ' more ': 9, U ' than ': ten, U ' he ': 2}
[[0.] 0.0. 0.0.28945906 0.
0.38060387 0.57891811 0.57891811 0.22479078 0.22479078]
[0. 0.0. 0.41715759 0.3172591 0.3172591
0.0.3172591 0.6345182 0.24637999 0.24637999]
[0.48359121 0.48359121 0.48359121 0. 0.0.36778358
0. 0.0. 0.28561676 0.28561676]]
In fact, you can use a function to complete all the steps: Tfidfvectorizer
From Sklearn.feature_extraction.text import tfidfvectorizer
Tfidf_vectorizer = Tfidfvectorizer (MIN_DF = 1)
Tfidf_matrix = Tfidf_vectorizer.fit_transform (mydoclist)
print tfidf_matrix.todense ()
[[0.0. 0.0. 0.28945906 0.
0.38060387 0.57891811 0.57891811 0.22479078 0.22479078]
[0. 0.0. 0.41715759 0.3172591 0.3172591
0.0.3172591 0.6345182 0.24637999 0.24637999]
[0.48359121 0.48359121 0.48359121 0.0. 0.36778358
0. 0.0. 0.28561676 0.28561676]]
And we can use this vocabulary space to process new observational documents, like this:
New_docs = [' He watches basketball and baseball ', ' Julie likes to play basketball ', ' Jane loves to play baseball ']
NE W_term_freq_matrix = Tfidf_vectorizer.transform (new_docs)
print tfidf_vectorizer.vocabulary_
print New_ Term_freq_matrix.todense ()
{u ' me ': 8, U ' Basketball ': 1, U ' Julie ': 4, U ' Baseball ': 0, U ' likes ': 5, U ' loves ': 7, U ' Jane ': 3, U ' Linda ': 6, U ' more ': 9, U ' than ': ten, U ' he ': 2}
[[0.57735027 0.57735027 0.57735027 0.] 0.0. 0.
0. 0.0. 0.]
[0.0.68091856 0. 0.0.51785612 0.51785612
0. 0.0. 0.0. ]
[0.62276601 0. 0.0.62276601 0. 0.0.
0.4736296 0. 0.0. ]]
Please note that there are no "watches" words in the New_term_freq_matrix. This is because the document we use for training is a document in Mydoclist, and the word does not appear in that corpus's vocabulary. In other words, it is outside our vocabulary dictionary.
Back to Amazon comment text
Exercise 2
Now it's time to try to use what you've learned. Using Tfidfvectorizer, you can try to build a TF-IDF weighted document word moment on the list of strings in the Amazon comment text.
Import OS
import CSV
#os. chdir ('/users/rweiss/dropbox/presentations/iriss2013/text1/fileformats/')
With open (' amazon/sociology_2010.csv ', ' RB ') as CSVFile:
amazon_reader = csv. Dictreader (CSVFile, delimiter= ', ')
amazon_reviews = [row[' Review_text '] for row in Amazon_reader]
#your Code Here!!!