Similarity measure (similarity), that is to calculate the similarity between individuals, the smaller the value of similarity measure, the smaller the similarity between individuals, the greater the value of similarity indicates the greater the individual difference.
For a number of different text or short message to calculate the similarity between them, a good way is to map the words in these text to vector space, to form the mapping of text and vector data, by calculating the size of several or more different vectors, to calculate the similarity of the text. The following is a detailed calculation of the similarity of a mature vector space cosine similarity method
cosine similarity of vector space (cosine similarity)
The cosine similarity is used to measure the difference between the two individuals by the cosine of the two vectors in the vector space. The closer the cosine is to 1, the closer the angle is to 0 degrees, that is, the more similar the two vectors, which is called "Cosine similarity."
The angle of the two vectors, a, B, is very small, it can be said that the a vector has very high similarity to the B. Vector, and in extreme cases A and B vectors are completely coincident. The following figure:
As shown in Figure two: it can be thought that A and B vectors are equal, that is, the text represented by a A and a vector is exactly similar, or equal. If the A and B vectors have large angles, or opposite directions. Figure below
As shown in Figure three: Two vectors a, b, the angle is very large can be said that a vector with a very good similarity, or that A and B vectors represent the text is basically not similar. Is it possible to calculate the similarity of an individual by using the function values of the angle size of the two vectors?
The cosine similarity theory of vector space is a method to calculate the similarity of individuals based on the above. Detailed reasoning process analysis is done below.
Think of the cosine formula, the most basic calculation method is junior high school's simplest formula, calculate the angle
figure (4)
The cosine formula for the value is:
But this is only applicable to right triangle, and in non-right triangle, the cosine theorem formula is
figure (5)
The cosine of the angle between the edges A and B in the triangle is calculated as:
formula (2)
In a vector representation of a triangle, suppose that a vector is (x1, y1), b vector is (x2, y2), then the cosine theorem can be rewritten into the following form:
figure (6)
The cosine of the angle between vector A and vector b is calculated as follows
Extension, if vectors A and B are not two-dimensional but n-dimensional, the above cosine calculation method is still correct. Suppose A and B are two n-dimensional vectors, A is, B is, then the cosine of the angle between A and B is equal to:
The closer the cosine is to 1, the closer the angle is to 0 degrees, which means that the more similar the two vectors are, the angle equals 0, or two vectors equal, which is called "Cosine similarity."
"Here's an example of how the cosine computes the similarity of text"
As an example, the similarity of text is calculated using the above theory. For the sake of simplicity, start with the sentence.
sentence A: The boot number is big. That's the right number .
sentence B: This boots number is not small, it is more suitable
How to calculate the similarity of the above two sentences.
The basic idea is that if the two words are more similar in terms, their content should be more similar. Therefore, we can start with the word frequency and calculate their similarity.
The first step, participle .
sentence A: this/boots/number/Big one. That's/number/suitable.
sentence B: this/boots/number/not/small, that only/more/suitable.
The second step is to list all the words.
this one, the boot, the number, it's big. That one, right, no, small, very
The third step is to calculate the word frequency.
sentence A: This is only 1, Boots 1, Number 2, big 1. That's 1, right 1, not 0, little 0, 0 .
sentence B: This is only 1, Boots 1, number 1, big 0. That's 1, right 1, not 1, little 1, 1 .
Fourth step, write the word frequency vector.
sentence A: (1,1,2,1,1,1,0,0,0)
sentence B: (1,1,1,0,1,1,1,1,1)
Here, the question becomes how to calculate the similarity between the two vectors. We can think of them as two line segments in space, all from the origin ([0, 0, ...] ), pointing in a different direction. An angle is formed between two segments, if the angle is 0 degrees, meaning the same direction, the line is coincident, this means that the two vectors represent the text is exactly equal, if the angle is 90 degrees, it means to form a right angle, the direction is completely not similar, if the angle is 180 degrees, it means that the direction is opposite. Therefore, we can judge the similarity of vectors by the size of the angle. The smaller the angle, the more similar the representation.
Use the formula above (4)
Calculate two sentence vectors
sentence A: (1,1,2,1,1,1,0,0,0)
and Sentence B: (1,1,1,0,1,1,1,1,1) the vector cosine value to determine the similarity of two sentences.
The calculation process is as follows:
The cosine of the angle in the calculation result is 0.81 very close to 1, so the above sentence A and sentence B are basically similar
Thus, we get the processing flow of the text similarity calculation:
(1) Find out the keywords of the two articles;
(2) Each article takes out several key words, merges into a set, calculates the word frequency of each article for the words in this set
(3) Generate two articles of the respective word frequency vector;
(4) Calculates the cosine similarity of two vectors, the greater the value, the more similar the representation.
Reference article: http://blog.sina.com.cn/s/blog_4a6b27a30102vbr0.html