Basic for automatic text classification-Term Frequency Calculation Method

Source: Internet
Author: User

Basic for automatic text classification-Term Frequency Calculation Method

It is said that the number of documents on the Internet is growing by 1 million every day. Such a large growth may take one month or more to patronize your website. So if you have optimized your webpage today, you will be watching Google's response one month later. This was the age of information explosion. When the Internet was just born, through the directory navigation mechanism, we could find the information we needed. Yahoo took this opportunity to succeed. Later, with the popularity of the Internet, the speed of information explosion has lost the effect of directory navigation. Google seized this opportunity and proposed a special search algorithm, so that people can still find information by ignoring the directory mechanism. Google also succeeded. However, just as we can't discard newspapers without the Internet, the directory navigation mechanism still plays a role. By looking at Google's personalized search service, you can find that Google is encouraging you to use the predefined search channel to make your search content more relevant.
That is to say, the directory classification mechanism of the search still exists, but does not directly face the end user, but the search engine, that is, automatic classification based on the document content.

There are many methods for automatic classification based on the document content. This article introduces the term frequency calculation method.

The basic idea of the vector space model is to regard the document as a vector based on the frequency weight of a word. To reduce information noise, the words mentioned here need to be processed in the following steps:

1. Perform word segmentation on the document and retrieve all words (Terms) contained in the document );
2. Remove meaningless words (TERM), such as Chinese: Yes;
3. Calculate the frequency of occurrence of each term;
4. filter out frequently-occurring terms and low-frequency terms as needed (similar to removing the highest score and lowest score in variety shows );
5. After processing this step, we assume that there are a total of W final words, and each word is marked with a unique tag.

After this step is processed, the subsequent steps vary according to the algorithm. However, a common feature is that the term weight must be used. The weight of a word is directly dependent on the frequency at which it appears. Because we want to analyze thousands of documents, the frequency of words appearing in one document cannot be described. Therefore, when considering word weights, we also need to consider the factors of multiple documents.
Now let's abstract the following:
1. Assume that the document to be processed is a set of D objects;
2. classification is a fuzzy description of a, and a is a subset of D;
3. The difficulty of classification is that the differentiation of D Objects is more inclined to that subset A (classification ).
So it seems that the weight of a word should include the following three parts:
1. determine the frequency of occurrence of a word in the current document;
2. document length factors;
3. Frequency of term in all documents, and importance of words in all documents;

If the frequency of words can be obtained accurately and the statistical method is used, the document classification should be more accurate.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.