last time, I used tf-idf algorithm automatically extracts keywords.
today, let's look at another related issue. Sometimes, in addition to finding keywords, we also want to find other articles similar to the original article. For example,"Google News " under the main news, also provides a number of similar news.
in order to find similar articles, it is necessary to use " cosine similarity "(cosine similiarity). Let me give you an example of what " cosine similarity "is.
For the sake of simplicity, let's start with the sentences.
sentences A : I like watching TV and I don't like watching movies.
sentences B : I don't like watching TV, and I don't like watching movies.
How can I calculate the similarity of the above two sentences?
The basic idea is that if the two words are more similar in terms, their content should be more similar. Therefore, we can start with the word frequency and calculate their similarity.
The first step, participle.
sentence a : I Span style= "Font-family:consolas" >/ like / look / TV, not / like / look / film.
sentence b : I Span style= "Font-family:consolas" >/ no / like / look / TV, also / no / like / look / film.
The second step is to list all the words.
I, like, watch, TV, movie, No, also.
The third step is to calculate the word frequency.
sentence a : I Span style= "Font-family:consolas" > 1 , like 2 See 2 , TV 1 , movie 1 1 0
sentence b : I Span style= "Font-family:consolas" > 1 , like 2 See 2 , TV 1 , movie 1 2 1
Fourth step, write the word frequency vector.
sentences A : [1, 2, 2, 1, 1, 1, 0]
sentences B : [1, 2, 2, 1, 1, 2, 1]
Here, the question becomes how to calculate the similarity between the two vectors.
we can think of them as two line segments in space, all from the origin ( [0, 0, ...] ), pointing in a different direction. An angle is formed between the two segments, if the angle is 0 degrees, which means that the direction is the same and the line is coincident; the means that the direction is exactly the opposite. Therefore, we can judge the similarity of vectors by the size of the angle. The smaller the angle, the more similar the representation.
taking the two-dimensional space as an example, a and the b is two vectors, and we're going to calculate their angle. θ . The cosine theorem tells us that you can use the following formula to calculate:
assume that a Vector is [x1, y1] , b Vector is [x2, y2] , you can rewrite the cosine theorem into the following form:
mathematicians have shown that this method of calculation of cosineNdimensional vectors are also established. Assume thatAand theBis twoNdimension vector,Ais a[A1, A2, ..., an],Bis a[B1, B2, ..., Bn], youAwith theBthe angleθthe cosine of is equal to:
using this formula, we can get the sentence A and sentences B the cosine of the angle.
The cosine is closer to the 1 This indicates that the closer the angle is to the 0 Span style= "Font-family:georgia" > " cosine similarity . So, the above sentence a and sentence b 20.3
As a result, we have " Find similar articles " an algorithm for:
( 1 ) using TF-IDF algorithm, find the keywords of two articles;
( 2 Each article takes a number of keywords (such as - ), combined into a set, calculates the frequency of each article for the words in this set (in order to avoid the difference in the length of the article, relative word frequency can be used);
( 3 generate two articles with their respective word frequency vectors;
( 4 calculates the cosine similarity of two vectors, and the larger the value, the more similar the representation.
" Cosine similarity degree " is a very useful algorithm that can be used as long as it calculates the similarity of two vectors.
Next time, I want to talk about how to automatically generate a summary of an article on the basis of word frequency statistics.
Finish
?
Application of TF-IDF and cosine similarity (II.): Finding similar articles