In fact, this question has been written by many people, just in the beauty of mathematics, and recently written by Ruan Yifeng's blog. Basically, this article follows his ideas, just to make it look a little bit small. In fact, to put it bluntly, we will use our own words to describe the same thing. By the way, we will expand the sentence and add a large part of the jump. Ruan Yifeng's original text: http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html
Of course, although the question is a comparison of twoArticleBut we are not stupid enough to take two articles to illustrate the similarity. For the sake of simplicity, we start with the sentence.
Sentence A: Jay Chou is a singer and a cross-stick
B: Jay Chou is not a cross, but a singer.
How to compare similarity?
Step 1 Word Segmentation
Sentence A: Jay Chou/Yes/A/singer,Also/Yes/One/Cross (Note: Assume that word segmentation is also a cross-handed cross, which can be recognized) Sentence B: Jay Chou/no/Yes/One/CrossBut/is/A/singer
Step 2 repeat to list all recognized words
Jay Chou, yes, no, one, cross, singer,
Step 3 Calculate Word Frequency(This indicates the number of times a word appears in a sentence.)
Sentence A: Jay Chou 1. Yes 2. NoA. 2. Cross. 1. Singer 1. But 0. Also 1. Sentence B: Jay Chou 1. Yes 2. No1. A 2. Cross-handed 1. Singer 1. But 1. 0.
Step 4: Construct Word Frequency Vectors
Sentence[1, 2, 0, 2, 1, 1, 0, 1]Sentence B[1, 2, 1, 2, 1, 1, 0, 1]
We constructed two multi-dimensional vectors. The value of each dimension is word frequency.
Okay,After the preceding two multi-dimensional vectors are constructed, the similarity between the two sentences is changed to the similarity between the two vectors.Any problem can be solved as long as it becomes a mathematical problem.:)......
So how can we compare the similarity between two vectors. First, let's take a look at the high school mathematics knowledge.
is the geometric representation of a two-dimensional vector. 2 Two-Dimensional Vectors A and B . θ is the 2 two-dimensional vector angle; If the angle is 0 degree, indicating the same direction and line segment overlap; if the angle is 90 degree, which means a right angle is formed and the direction is completely different. If the angle is 180 degree, meaning the opposite direction. Therefore, we can determine the similarity of vectors by the angle. The smaller the angle, the more similar .
Another junior high school knowledge: cosine theorem(It should be junior high school)
Assume thatAVector is[X1, Y1].BVector is[X2, y2]
Then we can change the cosine theorem to the following form.
This result is directly given in the original text. This is a little jump, giving a abrupt feeling, mainly because the derivation process is omitted, although the derivation process is indeed a little bit small, however, our question is to solve the problem, so we can make up the process below. Because I won't draw pictures, I use computer notation directly. Not intuitive
COS θ = (a ^ 2 + B ^ 2-C ^ 2)/2abc a ^2 = (x1 ^ 2 + Y1 ^ 2) B ^2 = (X2 ^ 2 + y2 ^ 2) C ^(X2-x1) ^ 2 + (y2-y1) ^ 2(This is not hard to understand)=> Cos θ= (X1 ^ 2 + Y1 ^ 2) + (X2 ^ 2 + y2 ^ 2) + (x2-x1) ^ 2 + (y2-y1) ^ 2 )) /(2 SQRT (x1 ^ 2 + Y1 ^ 2) * SQRT (X2 ^ 2 + y2 ^ 2))=> Cos θ= (X1 ^ 2 + Y1 ^ 2 + x2 ^ 2 + y2 ^ 2-x2 ^ 2-x1 ^ 2 + 2x1x2-y2 ^ 2-y1 ^ 2 + 2y1y2/(2 SQRT (x1 ^ 2 + Y1 ^ 2) * SQRT (X2 ^ 2 + y2 ^ 2))=> Cos θ= (2x1X2 + y1y2)/(2 SQRT (x1 ^ 2 + Y1 ^ 2) * SQRT (X2 ^ 2 + y2 ^ 2))=> Cos θ= (X1x2 + y1y2)/(SQRT (x1 ^ 2 + Y1 ^ 2) * SQRT (X2 ^ 2 + y2 ^ 2 ))
In this way, the results will be launched.
Now let's summarize the law of the cosine theorem in the case of two-dimensional vectors.
2Two-Dimensional Vectors[X1, Y1],[X2, y2]So there are
COS θ = (x1x2 + y1y2)/(SQRT (x1 ^ 2 + Y1 ^ 2) * SQRT (X2 ^ 2 + y2 ^ 2 ))
We set[X1, Y1],[X2, y2]Change[A1, a2],[B1, B2]So
Yes2Two-Dimensional Vectors[A1, a2],[B1, B2]Yes
COS θ = (a1b1 + A2B2)/(SQRT (A1 ^ 2 + A2 ^ 2) * SQRT (b1 ^ 2 + B2 ^ 2 ))
Mathematics has proved that this calculation of cosine is also true when it is extended to a multi-dimensional vector.
Now we have one more dimension, assuming there are2Items3Dimension vector, such[A1, A2, A3],[B1, B2, B3]What will it be like?
COS θ = (a1b1 + A2B2 + a3b3)/(SQRT (A1 ^ 2 + A2 ^ 2 + A3 ^ 2) * SQRT (b1 ^ 2 + B2 ^ 2 + B3 ^ 2 ))
SoNWhat is dimension?
COS θ = (a1b1 + A2B2 + a3b3 + .. anbn)/(SQRT (A1 ^ 2 + A2 ^ 2 + A3 ^ 2 +... + an ^ 2) * SQRT (b1 ^ 2 + B2 ^ 2 + B3 ^ 2 + .. + BN ^ 2 ))
May be more intuitive and concise
With this formula, it is easy to do. Our two sentences
Sentence[1, 2, 0, 2, 1, 1, 0, 1]Sentence B[1, 2, 1, 2, 1, 1, 0, 1]
What is the cosine of their angle?
COS θ = (1*1 + 2*2 + 0*1 + 2*2 + 1*1 + 1*1 + 0*0 + 1*1) /(SQRT (1 ^ 2 + 2 ^ 2 + 0 ^ 2 + 2 ^ 2 + 1 ^ 2 + 1 ^ 2 + 0 ^ 2 + 1 ^ 2) * SQRT (1 ^ 2 + 2 ^ 2 + 1 ^ 2 + 2 ^ 2 + 1 ^ 2 + 1 ^ 2 + 1 ^ 2 + 1 ^ 2 + 0 ^ 2 + 1 ^ 2))=> Cos θ ≈0.961
This is relatively high.