Reprinted from http://www.ruanyifeng.com/blog/
Last time I used TF-IDF algorithms to automatically extract keywords.
Today, let's look at another issue. Sometimes, in addition to finding keywords, we also hope to find other articles similar to the original article. For example, Google News provides similar news under the main news.
Cosine similiarity is used to identify similar articles ). The following is an example of cosine similarity ".
For the sake of simplicity, let's start with the sentence.
Sentence A: I like watching TV and not watching movies.
Sentence B: I do not like watching TV or watching movies.
How can we calculate the similarity between the above two statements?
The basic idea is: the more similar the two sentences are, the more similar they are. Therefore, we can start with Word Frequency and calculate their similarity.
Step 1: word segmentation.
Sentence A: I/like/Watch/TV, do not/like/Watch/movie.
Sentence B: I/I/movie.
Step 2: List all words.
I like, watch, TV, movie, no, too.
Step 3: Calculate the word frequency.
Sentence A: I like 2, watch 2, TV 1, Movie 1, not 1, or 0.
Sentence B: I like 2, watch 2, TV 1, Movie 1, not 2, and also 1.
Step 4: write out the word frequency vector.
Sentence A: [1, 2, 2, 1, 1, 1, 0]
Sentence B: [1, 2, 2, 1, 1, 2, 1]
Here, the question is how to calculate the similarity between the two vectors.
We can think of them as two line segments in the space, all starting from the origin ([0, 0,...]) and pointing to different directions. An angle is formed between two line segments. If the angle is 0 degrees, the direction is the same and the line segments overlap. If the angle is 90 degrees, the angle is formed and the direction is completely different; if the angle is 180 degrees, it means the opposite direction. Therefore, we can determine the similarity between vectors by the angle. The smaller the angle, the more similar it is.
Taking a two-dimensional space as an example, A and B are two vectors. We need to calculate their angle θ. The cosine theorem tells us that we can use the following formula:
If the vector A is [X1, Y1] and the vector B is [X2, y2], you can rewrite the cosine theorem to the following form:
Mathematicians have proved that this calculation method of cosine is also true for n-dimensional vectors. Assume that A and B are two n-dimensional vectors, and A is [a1, a2 ,..., an], B is [b1, b2 ,..., BN], then the cosine of the angle θ between A and B is equal:
Using this formula, we can obtain the cosine of the angle between sentence a and sentence B.
The closer the cosine value is to 1, the closer the angle is to 0, that is, the closer the two vectors are, this is called "cosine similarity ". Therefore, the preceding sentence a and sentence B are very similar. In fact, their angle is about 20.3 degrees.
As a result, we get an algorithm for "Finding similar articles:
(1) using the TF-IDF algorithm to find out the keywords of the two articles;
(2) Each Article extracts several keywords (such as 20) and merges them into a set to calculate the word frequency of each article in this set (to avoid the length difference of the article, can use relative term frequency );
(3) generate the Word Frequency vectors of the two articles;
(4) Calculate the cosine similarity between two vectors. A larger value indicates a more similar cosine.
Cosine similarity is a very useful algorithm. It can be used to calculate the similarity between two vectors.
Next time, I want to talk about how to automatically generate a summary of an article based on Word Frequency Statistics.