[How to compare similarity between two articles]

Source: Internet
Author: User

In fact, this question has been written by many people, just in the beauty of mathematics, and recently written by Ruan Yifeng's blog. Basically, this article follows his ideas, just to make it look a little bit small. In fact, to put it bluntly, we will use our own words to describe the same thing. By the way, we will expand the sentence and add a large part of the jump. Ruan Yifeng's original text: http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html 

Of course, although the question is a comparison of twoArticleBut we are not stupid enough to take two articles to illustrate the similarity. For the sake of simplicity, we start with the sentence.

Sentence A: Jay Chou is a singer and a cross-stick

B: Jay Chou is not a cross, but a singer.

How to compare similarity?

Step 1 Word Segmentation

 
Sentence A: Jay Chou/Yes/A/singer,Also/Yes/One/Cross (Note: Assume that word segmentation is also a cross-handed cross, which can be recognized) Sentence B: Jay Chou/no/Yes/One/CrossBut/is/A/singer

Step 2 repeat to list all recognized words

 
Jay Chou, yes, no, one, cross, singer,

Step 3 Calculate Word Frequency(This indicates the number of times a word appears in a sentence.)

 
Sentence A: Jay Chou 1. Yes 2. NoA. 2. Cross. 1. Singer 1. But 0. Also 1. Sentence B: Jay Chou 1. Yes 2. No1. A 2. Cross-handed 1. Singer 1. But 1. 0.

Step 4: Construct Word Frequency Vectors

 
Sentence[1, 2, 0, 2, 1, 1, 0, 1]Sentence B[1, 2, 1, 2, 1, 1, 0, 1]

We constructed two multi-dimensional vectors. The value of each dimension is word frequency.

Okay,After the preceding two multi-dimensional vectors are constructed, the similarity between the two sentences is changed to the similarity between the two vectors.Any problem can be solved as long as it becomes a mathematical problem.:)......

 

So how can we compare the similarity between two vectors. First, let's take a look at the high school mathematics knowledge.

 

is the geometric representation of a two-dimensional vector. 2 Two-Dimensional Vectors A and B . θ is the 2 two-dimensional vector angle; If the angle is 0 degree, indicating the same direction and line segment overlap; if the angle is 90 degree, which means a right angle is formed and the direction is completely different. If the angle is 180 degree, meaning the opposite direction. Therefore, we can determine the similarity of vectors by the angle. The smaller the angle, the more similar .

Another junior high school knowledge: cosine theorem(It should be junior high school)

Assume thatAVector is[X1, Y1].BVector is[X2, y2]

 

Then we can change the cosine theorem to the following form.

 

This result is directly given in the original text. This is a little jump, giving a abrupt feeling, mainly because the derivation process is omitted, although the derivation process is indeed a little bit small, however, our question is to solve the problem, so we can make up the process below. Because I won't draw pictures, I use computer notation directly. Not intuitive

COS θ = (a ^ 2 + B ^ 2-C ^ 2)/2abc a ^2 = (x1 ^ 2 + Y1 ^ 2) B ^2 = (X2 ^ 2 + y2 ^ 2) C ^(X2-x1) ^ 2 + (y2-y1) ^ 2(This is not hard to understand)=> Cos θ= (X1 ^ 2 + Y1 ^ 2) + (X2 ^ 2 + y2 ^ 2) + (x2-x1) ^ 2 + (y2-y1) ^ 2 )) /(2 SQRT (x1 ^ 2 + Y1 ^ 2) * SQRT (X2 ^ 2 + y2 ^ 2))=> Cos θ= (X1 ^ 2 + Y1 ^ 2 + x2 ^ 2 + y2 ^ 2-x2 ^ 2-x1 ^ 2 + 2x1x2-y2 ^ 2-y1 ^ 2 + 2y1y2/(2 SQRT (x1 ^ 2 + Y1 ^ 2) * SQRT (X2 ^ 2 + y2 ^ 2))=> Cos θ= (2x1X2 + y1y2)/(2 SQRT (x1 ^ 2 + Y1 ^ 2) * SQRT (X2 ^ 2 + y2 ^ 2))=> Cos θ= (X1x2 + y1y2)/(SQRT (x1 ^ 2 + Y1 ^ 2) * SQRT (X2 ^ 2 + y2 ^ 2 ))

In this way, the results will be launched.

Now let's summarize the law of the cosine theorem in the case of two-dimensional vectors.

2Two-Dimensional Vectors[X1, Y1],[X2, y2]So there are

 
COS θ = (x1x2 + y1y2)/(SQRT (x1 ^ 2 + Y1 ^ 2) * SQRT (X2 ^ 2 + y2 ^ 2 ))

We set[X1, Y1],[X2, y2]Change[A1, a2],[B1, B2]So

Yes2Two-Dimensional Vectors[A1, a2],[B1, B2]Yes

 
COS θ = (a1b1 + A2B2)/(SQRT (A1 ^ 2 + A2 ^ 2) * SQRT (b1 ^ 2 + B2 ^ 2 ))

Mathematics has proved that this calculation of cosine is also true when it is extended to a multi-dimensional vector.

Now we have one more dimension, assuming there are2Items3Dimension vector, such[A1, A2, A3],[B1, B2, B3]What will it be like?

COS θ = (a1b1 + A2B2 + a3b3)/(SQRT (A1 ^ 2 + A2 ^ 2 + A3 ^ 2) * SQRT (b1 ^ 2 + B2 ^ 2 + B3 ^ 2 ))

SoNWhat is dimension?

 
COS θ = (a1b1 + A2B2 + a3b3 + .. anbn)/(SQRT (A1 ^ 2 + A2 ^ 2 + A3 ^ 2 +... + an ^ 2) * SQRT (b1 ^ 2 + B2 ^ 2 + B3 ^ 2 + .. + BN ^ 2 ))

May be more intuitive and concise

With this formula, it is easy to do. Our two sentences

Sentence[1, 2, 0, 2, 1, 1, 0, 1]Sentence B[1, 2, 1, 2, 1, 1, 0, 1] 

What is the cosine of their angle?

 
COS θ = (1*1 + 2*2 + 0*1 + 2*2 + 1*1 + 1*1 + 0*0 + 1*1) /(SQRT (1 ^ 2 + 2 ^ 2 + 0 ^ 2 + 2 ^ 2 + 1 ^ 2 + 1 ^ 2 + 0 ^ 2 + 1 ^ 2) * SQRT (1 ^ 2 + 2 ^ 2 + 1 ^ 2 + 2 ^ 2 + 1 ^ 2 + 1 ^ 2 + 1 ^ 2 + 1 ^ 2 + 0 ^ 2 + 1 ^ 2))=> Cos θ ≈0.961

This is relatively high.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.