Original: http://www.google.com.hk/ggblog/googlechinablog/2006/07/12_4010.html
Google's news is automatically sorted and sorted. The so-called classification of news is to put similar news into a class. The computer actually can't read news, it can only be calculated quickly. This requires us to design an algorithm to calculate the similarity of any two news articles. To do this, we need to find a way to describe a piece of news with a set of numbers.
For all the notional words in a news article, we can calculate their single text lexical frequency/Inverse text frequency value (TF/IDF). It is not difficult to imagine, and the news topics related to those notional words high frequency, TF/IDF value is very large. We sort their tf/idf values according to the position of these notional words in the vocabulary table. For example, the glossary has 64,000 words, respectively
Word number Chinese words
------------------
1 o
2.
3 Fools
4 Aunt
...
789 Clothing
....
64000 affectation
In a news article, the TF/IDF values of these 64,000 words were
Word number TF/IDF value
==============
1 0
2 0.0034
3 0
4 0.00052
5 0
...
789 0.034
...
64000 0.075
If one of the words in the list does not appear in the news, and the corresponding value is zero, then these 64,000 numbers form a 64,000-D vector. We use this vector to represent this piece of news and become a feature vector of the news. If the two news feature vectors are similar, the corresponding news content is the same, they should be grouped into one class, and vice versa.
As anyone who has studied vector algebra knows, vectors are actually directional segments in multidimensional space. If the two vectors are in the same direction, that is, the angle is close to 0, then the two vectors are similar. And to determine whether the two vectors are consistent, this will use the cosine theorem to compute the angle of the vector.