Python implements VSM-based cosine Similarity Calculation

Last Update:2015-11-23 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python implements VSM-based cosine Similarity Calculation
In the case of entity alignment and attribute value decision in the building phase of the knowledge graph, determining whether an article is your favorite article, and comparing the similarity of the two articles, it involves the knowledge of Vector Space Model (VSM) and Cosine similarity calculation.
This article describes the theoretical knowledge of VSM and Cosine similarity first, then uses the example of Ruan Yifeng's great god for explanation, and finally implements cosine similarity calculation of Baidu encyclopedia and interactive encyclopedia Infobox through Python.

I. Basic Knowledge

For the first part, refer to my article: Named Entity identification, Ambiguity Resolution, and reference resolution based on VSM.

Step 1,Vector space model VSM
Vector Space Model (VSM) represents text representation by means of vectors. A Document is described as a series of keywords (Term) vectors.
In short, it is used to determine whether an article is your favorite article. The article is abstracted into a vector consisting of n Term words, each of which has a Weight (Term Weight ), different words affect the importance of Relevance Based on their weights in the document.
Document = {term1, term2 ,...... , TermN}
Document Vector = {weight1, weight2 ,...... , WeightN}

Where ti (I =,... n) is a column of different words, and wi (d) is the weight of ti in d.
When selecting feature words, you need to reduce the dimension to select representative feature words, including manually selected or automatically selected.
Step 2, TF-IDF
After feature extraction, You need to assign different weights to each word because each word has different contribution to the entity. Calculate the weight method of word items in the vector-TF-IDF.
It indicates the product of TF (Word Frequency) and IDF (inverted Document Frequency:

Term Frequency (TF) indicates the number of times a feature word appears divided by the total number of words in the article:

TF indicates the frequency when a keyword appears. IDF indicates the number of all documents divided by the logarithm of the number of documents containing the word.
| D | indicates the number of all documents. | w * d | indicates the number of documents containing word w.
Because words such as "yes" and "this" often appear, the IDF value is required to reduce its weight. Dimension Reduction means dimensionality reduction. Specifically, document similarity calculation reduces the number of words. Common words that can be used for dimensionality reduction are mainly functional words and deprecated words (such as:, and so on). In fact, adopting a dimensionality reduction strategy can not only improve the efficiency in many cases, it can also improve the accuracy.
The greater the TF-IDF weight, the more important the entry is to the text.

Step 3: cosine Similarity Calculation
In this way, you need a group of articles you like to calculate the IDF value. Calculate the weights of n keywords in your favorite article D = (w1, w2,..., wn. When you give an article "E", use the same method to calculate E = (q1, q2,..., qn), and then calculate the similarity between D and E.
Calculate the similarity between the two articles and describe it through the cosine angle cos of the two vectors. The similarity formulas for text D1 and D2 are as follows:
The numerator represents the dot product of two vectors, and the denominator represents the product of the modulus of the two vectors.
After calculation, the similarity can be obtained. We can also manually select two highly similar documents, calculate their similarity, and then define their thresholds. Similarly, an article can take the average value or find the center of the vector of a type of article for calculation. It is mainly to resolve the language problem into a mathematical problem.
Disadvantages: the calculation amount is too large, the weights of words need to be re-trained when new texts are added, and the associations between words are not considered. The cosine theorem indicates the reference between similarity.

Ii. instance explanation

The second part mainly refer to Ruan Yifeng's personal blog, for example, to explain VSM to achieve cosine similarity calculation, it is strongly recommended that you read Shen's blog: The application of TF-IDF and Cosine Similarity
This part is reprinted. The following is a simple example (the third part is a relatively complex example ):

Sentence A: I like watching TV and not watching movies.

Sentence B: I do not like watching TV or watching movies.

How can we calculate the similarity between the above two statements?
The basic idea is: the more similar the two sentences are, the more similar they are. Therefore, we can start with Word Frequency and calculate their similarity.

Step 1: word segmentation.

Sentence A: I/like/Watch/TV, do not/like/Watch/movie.

Sentence B: I/I/movie.

Step 2: List all words.

I like, watch, TV, movie, no, too.

Step 3: Calculate the word frequency.

Sentence A: I like 2, watch 2, TV 1, Movie 1, not 1, or 0.

Sentence B: I like 2, watch 2, TV 1, Movie 1, not 2, and also 1.

Step 4: write out the word frequency vector.

Sentence A: [1, 2, 2, 1, 1, 1, 0]

Sentence B: [1, 2, 2, 1, 1, 2, 1]

Here, the question is how to calculate the similarity between the two vectors.

Using the cosine formula, we can obtain the cosine of the angle between sentence A and sentence B.

The closer the cosine value is to 1, the closer the angle is to 0, that is, the closer the two vectors are, this is called cosine similarity. Therefore, the preceding sentence A and sentence B are very similar. In fact, their angle is about 20.3 degrees.
As a result, we obtain an algorithm for finding similar articles:

(1) using the TF-IDF algorithm to find out the keywords of the two articles;
(2) Each Article extracts several keywords (such as 20) and merges them into a set to calculate the word frequency of each article in this set (to avoid the length difference of the article, can use relative term frequency );
(3) generate the Word Frequency vectors of the two articles;
(4) Calculate the cosine similarity between two vectors. A larger value indicates a more similar cosine.

Cosine similarity is a very useful algorithm. It can be used to calculate the similarity between two vectors.

PS: This part of the content is completely copied to the ghost's blog, because it is really easy to understand, and I am a little disappointed. If the copyright is inappropriate, I can delete it and recommend that you read more about it.
Ruan Yifeng personal blog links: http://www.ruanyifeng.com/home.html

Iii. Code Implementation

Finally, I will briefly explain how to implement the similarity calculation of message box InfoBox in Baidu encyclopedia and interactive encyclopedia using Python. For the crawler part, refer to my blog:
[Python crawler] Selenium gets the InfoBox message box of Baidu encyclopedia Tourist Attractions

I have crawled all InfoBox message boxes of "national 5A SCENIC SPOTS" through Selenium and used open-source word segmentation tools for word segmentation. The data of "Forbidden City" is as follows:

The following code computes the message box similarity between "Baidu encyclopedia-Forbidden City" and "Interactive encyclopedia-Forbidden City. Basic steps:
1. Calculate the keywords of the two documents, read the txt file, and count the CountKey () function.
2. Merge the keywords of the two articles into a set of MergeKey () functions, merge the same and add different
3. Calculate the Word Frequency TF-IDF Algorithm for this set of words for each article to calculate the weight, here only Word Frequency
4. Generate the Word Frequency vectors of the two articles
5. Calculate the cosine similarity between two vectors. A larger value indicates a more similar cosine.

#-*-Coding: UTF-8-*-import time import re import osimport stringimport sysimport math ''' # statistical keywords and quantity def CountKey (fileName, resultName ): try: # calculate the number of file lines lineNums = len (open (fileName, 'ru '). readlines () print U' number of file lines: '+ str (lineNums) # statistics format
 
  
<Attribute: number of records displayed> I = 0 table = {} source = open (fileName, r) result = open (resultName, w) while I <lineNums: line = source. readline () line = line. rstrip ('') print line words = line. split () # Separate print str (words) with spaces ). decode ('string _ escape ') # list display Chinese # dictionary insertion and assignment for word in words: if word! = And table. has_key (word): # Add 1 num = table [word] table [word] = num + 1 elif word if it exists! =: # Otherwise, the initial value is 1 table [word] = 1 I = I + 1 # key-value sorting function prototype: sorted (dic, value, reverse) dic = sorted (table. iteritems (), key = lambda asd: asd [1], reverse = True) for I in range (len (dic): # print 'key = % s, value = % s' % (dic [I] [0], dic [I] [1]) result. write (<+ dic [I] [0] +: + str (dic [I] [1]) +>) return dic parse t Exception, e: print 'error :', e finally: source. close () result. close () print 'end''' --------------------------------------------------------- ''' # Calculate the keyword and number, and calculate the similarity def MergeKeys (dic1, dic2 ): # merge keywords use three arrays to implement arrayKey = [] for I in range (len (dic1): arrayKey. append (dic1 [I] [0]) # Add the element for I in range (len (dic2) to the array: if dic2 [I] [0] in arrayKey: print 'has _ key', dic2 [I] [0] else: # merge arrayKey. append (dic2 [I] [0]) else: print ''test = str (arrayKey ). decode ('string _ escape ') # character conversion print test # Calculate Word Frequency infobox can ignore TF-IDF arrayNum1 = [0] * len (arrayKey) arrayNum2 = [0] * len (arrayKey) # assign arrayNum1 for I in range (len (dic1 )): key = dic1 [I] [0] value = dic1 [I] [1] j = 0 while j <len (arrayKey): if key = arrayKey [j]: arrayNum1 [j] = value break else: j = j + 1 # assign a value to arrayNum2 for I in range (len (dic2 )): key = dic2 [I] [0] value = dic2 [I] [1] j = 0 while j <len (arrayKey): if key = arrayKey [j]: arrayNum2 [j] = value break else: j = j + 1 print arrayNum1 print arrayNum2 print len (arrayNum1), len (arrayNum2), len (arrayKey) # Calculate the dot product x = 0 I = 0 while I <len (arrayKey) of two vectors ): x = x + arrayNum1 [I] * arrayNum2 [I] I = I + 1 print x # Calculate the modulo I of the two vectors = 0 sq1 = 0 while I <len (arrayKey ): sq1 = sq1 + arrayNum1 [I] * arrayNum1 [I] # pow (a, 2) I = I + 1 print sq1 I = 0 sq2 = 0 while I <len (arrayKey): sq2 = sq2 + arrayNum2 [I] * arrayNum2 [I] I = I + 1 print sq2 result = float (x)/(math. sqrt (sq1) * math. sqrt (sq2) return result ''' --------------------------------------------------------- basic steps: 1. calculate the keywords of the two documents. merge the keywords of the two articles into a set, merge the same, and add different three. calculate the Word Frequency TF-IDF Algorithm for this set of words for each article to calculate the weight 4. generate the Word Frequency vectors of the two articles 5. calculate the cosine similarity between the two vectors. The larger the value, the more similar it is. ------------------------------------------- ''' # main Function def main (): # computing document 1-Baidu keyword and number fileName1 = BaiduSpider.txt resultName1 = Result_Key_BD.txt dic1 = CountKey (fileName1, resultName1) # computing document 2-interaction keyword and number fileName2 = HudongSpider \ 001.txt resultName2 = HudongSpider \ Result_Key_001.txt dic2 = CountKey (fileName2, resultName2) # merge the keywords and similarity calculation results of the two articles = MergeKeys (dic1, dic2) print resultif _ name _ = '_ main _': main ()

Because only the similarity of InfoBox message box needs to be calculated, there is no need to calculate the TF-IDF value, the weight can be expressed by word frequency, after adding a loop in the code, the similarity between the "Forbidden City" of Baidu encyclopedia and the different entities of interactive encyclopedia can be calculated. The running results are as follows, and the highest similarity is found between the "Beijing Forbidden City" and "Forbidden City. This is also a simple object alignment.

I hope the article will help you, especially the code section. If there are any errors or deficiencies in this article, please refer to haihan ~ After all, the author is still learning. If there are good methods and implementation code for object alignment and attribute alignment, I can also recommend it to 3Q.
Finally, I would like to refer to and recommend some related articles on VSM and Cosine similarity calculation:
Application of similarity between TF-IDF and Cosine (I): Automatic Extraction
Application of similarity between TF-IDF and Cosine (2): Finding similarity
Lucene learning-similarity calculation Model VSM (Vector Space Model)
VSM vector space model for text classification and simple implementation-java
Correct rate, recall rate, and F value-silence1214
Vector space model (VSM)-wyy_820211 Netease blog
Cosine theorem formula of vector space model (VSM)-live41
Vector space model document similarity calculation implementation (C #)-felomeng
A Brief Introduction to vector space model (VSM) in document similarity calculation-felomeng
Implicit Markov model learning Summary-a123456ei
Vector space model VSM-ljiabin

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More