Poster: Wu Jun, Google researcher
The cosine theorem and the classification of news seem to be two things out of reach, but they are closely related. Specifically, the classification of news relies heavily on the cosine theorem.
Google News is automatically classified and organized. The so-called news classification is nothing more than putting similar news into a category. A computer can only perform computation quickly because it does not understand news. This requires us to design an algorithm to calculate the similarity of any two news articles. To do this, we need to find a way to describe a piece of news with a set of numbers.
Let's take a look at how to find a set of numbers or a vector to describe a news article. Recall the concept of TF/IDF introduced in "How to measure webpage Relevance. For all the real words in a news article, we can calculate their single text vocabulary Frequency/inverse text frequency value (TF/IDF ). It is hard to imagine that the frequency of real words related to news topics is high, and the TF/IDF value is very large. We sort the TF/IDF values of these words in the vocabulary. For example, a vocabulary contains 64,000 words, which are
Word no.
------------------
1
2.
3. Alibaba Cloud
4 Ayi
...
789 clothing
....
64000 homework
In a news article, the TF/IDF values of the 64,000 words are
Word number TF/IDF Value
====================
1 0
2 0.0034
3 0
4 0.00052
5 0
...
789 0.034
...
64000 0.075
If a word in a word table does not appear in News, and the corresponding value is zero, the 64,000 number forms a 64,000-dimensional vector. We use this vector to represent the news and become the feature vector of news. If the feature vectors of two news articles are similar, the corresponding news content is similar. They should be in the same category, and vice versa.
Anyone who has learned vector algebra knows that a vector is actually a line segment with a direction in a multi-dimensional space. If the direction of the two vectors is the same, that is, the angle is close to zero, then the two vectors are similar. To determine whether the direction of the two vectors is consistent, the cosine theorem is used to calculate the angle of the vector.
The cosine theorem is no stranger to each of us. It describes the relationship between any angle in a triangle and three sides. In other words, given the three sides of a triangle, we can use the cosine theorem to obtain the angle of each angle of a triangle. Assume that the three sides of a triangle are A, B, and c, and the corresponding three angles are A, B, and C, then the cosine of angle --
If we regard B and c on both sides of a triangle as two vectors, the above formula is equivalent
The denominator represents the length of the two vectors B and c, and the numerator represents the inner product of the two vectors. For example, if the vectors corresponding to news X and news Y are
X1, x2,..., x64000 and
Y1, y2,..., y64000,
The cosine of the angle is equal,
When the cosine of the two news vectors is the same as that of the moment, the two news are completely duplicated (this method can be used to delete duplicate webpages). When the cosine of the angle is close to that of the moment, the two news are similar, this can be classified into a category; the smaller the cosine of the angle, the less irrelevant the two news.
When we learned the cosine theorem in middle school, it was hard to imagine that it could be used to classify news. Here, we can see the purpose of the mathematical tool again.