Google News (article) classification algorithm

Source: Internet
Author: User
Tags idf

Original: http://www.google.com.hk/ggblog/googlechinablog/2006/07/12_4010.html

Google's news is automatically sorted and sorted. The so-called classification of news is to put similar news into a class. The computer actually can't read news, it can only be calculated quickly. This requires us to design an algorithm to calculate the similarity of any two news articles. To do this, we need to find a way to describe a piece of news with a set of numbers.

For all the notional words in a news article, we can calculate their single text lexical frequency/Inverse text frequency value (TF/IDF). It is not difficult to imagine, and the news topics related to those notional words high frequency, TF/IDF value is very large. We sort their tf/idf values according to the position of these notional words in the vocabulary table. For example, the glossary has 64,000 words, respectively

Word number Chinese words
------------------
1 o
2.
3 Fools
4 Aunt
...
789 Clothing
....
64000 affectation

In a news article, the TF/IDF values of these 64,000 words were

Word number TF/IDF value
==============
1 0
2 0.0034
3 0
4 0.00052
5 0
...
789 0.034
...
64000 0.075


If one of the words in the list does not appear in the news, and the corresponding value is zero, then these 64,000 numbers form a 64,000-D vector. We use this vector to represent this piece of news and become a feature vector of the news. If the two news feature vectors are similar, the corresponding news content is the same, they should be grouped into one class, and vice versa.

As anyone who has studied vector algebra knows, vectors are actually directional segments in multidimensional space. If the two vectors are in the same direction, that is, the angle is close to 0, then the two vectors are similar. And to determine whether the two vectors are consistent, this will use the cosine theorem to compute the angle of the vector.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.