Mathematical beauty Series 12-cosine theorem and news Classification

Source: Internet
Author: User
Tags idf
 

Poster: Wu Jun, Google researcher

The cosine theorem and the classification of news seem to be two things out of reach, but they are closely related. Specifically, the classification of news relies heavily on the cosine theorem.

Google News is automatically classified and organized. The so-called news classification is nothing more than putting similar news into a category. A computer can only perform computation quickly because it does not understand news. This requires us to design an algorithm to calculate the similarity of any two news articles. To do this, we need to find a way to describe a piece of news with a set of numbers.

Let's take a look at how to find a set of numbers or a vector to describe a news article. Recall the concept of TF/IDF introduced in "How to measure webpage Relevance. For all the real words in a news article, we can calculate their single text vocabulary Frequency/inverse text frequency value (TF/IDF ). It is hard to imagine that the frequency of real words related to news topics is high, and the TF/IDF value is very large. We sort the TF/IDF values of these words in the vocabulary. For example, a vocabulary contains 64,000 words, which are

Word no.
------------------
1
2.
3. Alibaba Cloud
4 Ayi
...
789 clothing
....
64000 homework

In a news article, the TF/IDF values of the 64,000 words are

Word number TF/IDF Value
====================
1 0
2 0.0034
3 0
4 0.00052
5 0
...
789 0.034
...
64000 0.075

If a word in a word table does not appear in News, and the corresponding value is zero, the 64,000 number forms a 64,000-dimensional vector. We use this vector to represent the news and become the feature vector of news. If the feature vectors of two news articles are similar, the corresponding news content is similar. They should be in the same category, and vice versa.

Anyone who has learned vector algebra knows that a vector is actually a line segment with a direction in a multi-dimensional space. If the direction of the two vectors is the same, that is, the angle is close to zero, then the two vectors are similar. To determine whether the direction of the two vectors is consistent, the cosine theorem is used to calculate the angle of the vector.

The cosine theorem is no stranger to each of us. It describes the relationship between any angle in a triangle and three sides. In other words, given the three sides of a triangle, we can use the cosine theorem to obtain the angle of each angle of a triangle. Assume that the three sides of a triangle are A, B, and c, and the corresponding three angles are A, B, and C, then the cosine of angle --

If we regard B and c on both sides of a triangle as two vectors, the above formula is equivalent

The denominator represents the length of the two vectors B and c, and the numerator represents the inner product of the two vectors. For example, if the vectors corresponding to news X and news Y are
X1, x2,..., x64000 and
Y1, y2,..., y64000,
The cosine of the angle is equal,

When the cosine of the two news vectors is the same as that of the moment, the two news are completely duplicated (this method can be used to delete duplicate webpages). When the cosine of the angle is close to that of the moment, the two news are similar, this can be classified into a category; the smaller the cosine of the angle, the less irrelevant the two news.

When we learned the cosine theorem in middle school, it was hard to imagine that it could be used to classify news. Here, we can see the purpose of the mathematical tool again.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.