Machine Learning-a summary of Text Classification

Source: Internet
Author: User

Definition of Text Classification

Text classification is a very popular research area and the most important and fundamental part of machine learning. There are various methods for text classification, some of which are easy to understand and some seem very complex. In fact, it is not very difficult to understand the principles behind them. Today, we will introduce text classification from a macro perspective. In the future, we will analyze the topic of text classification in different categories in other blog posts, so stay tuned. I also hope that experts will give more suggestions. After all, I am also a cainiao.

Text classification isArticleIn the existing several categories, the emphasis is on two points:

1. the category to be classified must be determined in advance and will not change in a short time.

2. the category of a category is not necessarily unique.

Text Classification Method

1. Manually develop rules

The biggest drawback of this method is that it involves too much manpower and costs. The requirements on people are extremely high, and it is difficult to abstract the rules of a certain category through the article. In addition, this method is too flexible to adapt to the development of language, so few people use it.

2. Statistical Methods

The basic idea of statistical learning is:Let machines summarize their experiences by observing a large number of similar documents like humans, as the basis for future classification.Statistical Learning requires a batch of documents accurately classified by humans as learning materials (called training sets, note that it is much less costly to classify a batch of documents by person than to summarize accurate rules from these documents). Computers can remine some rules that can be effectively classified from these documents, this process is called training, and the summarized rule set is often called a classifier. After training, you need to classify documents that have never been seen by the computer, and then use these classifiers.

Statistical Learning Method

As mentioned before, this method allows computers to learn well-classified training sets by themselves. However, computers are not human beings and cannot learn as humans understand the article, this raises an important topic for text classification. Therefore, how to express an article makes it a top priority for computers to understand it. We know that the semantic information in the article is hard to be described in a form that can be recognized by computers. Therefore, we can only go back to the next step,The document is represented by the lower-level vocabulary information contained in the document. Facts have proved that this practice has a good effect.

Further, not only are words important, but the number of times these words appear is also important for classification.

After determining how to represent the document, we will discuss how to let the computer learn the document, that is, the training we are talking about.

Each instance is called a sample during training.A collection of documents that have been manually classified and processed. Computers believe that the data is absolutely correct and trustworthy. Then, let the computer observe and learn these samples to guess a possible classification rule. In machine learning, this speculative rule is called hypothesis. Then, when a document is to be classified, we use our assumptions to judge and classify the document.

For example,When people think of a car as a "good car", it can be seen as a classification problem. We can also extract all the features of a vehicle into vector form. In this problem, the dictionary vector can be:D = (price, top speed, appearance score, cost effectiveness, degree of scarcity)

Porsche:Vp = (2 million, 320, 9.5)

Corolla:Vt = (0.15 million, 220, 6.0, 8, 3)

Different people have different ratings. If we look at the cost-effectiveness, it is obvious that the corolla is the first choice. If you consider speed, appearance, and fashion, of course you should choose Porsche.It can be seen that the same classification problem is represented in the same form (the same document model), but different conclusions may be drawn due to the different characteristics of data. Different aspects of document data lead to different principles and implementation methods, each method also makes some assumptions and simplification for the text classification problem. These assumptions then affect the final performance of the classifier obtained based on these methods.

Common Classification Methods

Classification can be said to be the most widely studied part in the machine learning field. At present, there are many matureAlgorithm. For exampleDecision tree, rocchio, Naive Bayes, neural network, support vector machine, Linear Least Square Fitting, KNN, genetic algorithm, maximum entropy, etc. Here are a few simple introductions, in the future, blog posts will elaborate on some of these methods in detail. I hope you will always come and see them.

1 rocchio Algorithm

The idea of this method is to get a new vector by moving the vectors of all documents in a classification to the mean value, which is equivalent to the center of the classification.When new documents need to be judged, compare the similarities between the new documents and the centroid, that is, calculate the distance between the new documents and the centroid. Determine whether the distance belongs to the classification. Also improvedIn addition to the center of all positive samples, the rocchio algorithm also considers the center of all documents that do not belong to this document. In this way, the new document should be close to the center of the positive sample, away from the center of the negative sample.

This algorithm has a fatal defect:

First, this algorithm assumes that all documents of the same type are clustered around a center, which clearly has no basis, and it turns out that this is not the case.

Another drawback is that the algorithm considers the training data to be absolutely correct, which cannot be guaranteed in many applications.

2Naive Bayes Algorithm

Bayesian algorithms focus on the probability that a document belongs to a certain category. The probability that a document belongs to a certain category is equal to the comprehensive expression of the probability that each word belongs to this category in the document. To a certain extent, the probability that each word belongs to this category can be roughly estimated by the number of times (Word Frequency Information) This word appears in this category training document, therefore, the entire computing process becomes feasible. When using the naive Bayes algorithm, the main task in the training phase is to estimate these values.

Likewise, this method has some drawbacks:

First, the reason why P (d | CI) can be expanded to form the product of (type 1) is that the words in an article are independent of each other, the appearance of one word is not affected by the appearance of another word. But this is obviously not true. Even if we are not a linguistic expert, we know that there is an obvious so-called "co-occurrence" Relationship between words. In articles on different topics, the number or frequency of co-occurrence may change, but they cannot be independent of each other.

Secondly, when a word appears in a category training document to estimate the number of times P (WI | CI), it is more accurate only when the number of training samples is very large, however, the need for a large number of samples not only brings higher requirements to the Manual classification work in the early stage (resulting in higher costs ), it also puts forward higher requirements on storage and computing resources when it is processed by computers later.

3KNN algorithm

The KNN algorithm is different. In The KNN algorithm, the training sample represents the accurate information of the category. Therefore, the classifier generated by this algorithm is also called an instance-based classifier, no matter what features the sample uses to represent. The basic idea is to calculate the similarity between the feature vectors of the new document and the vectors of each document in the training document set after a new document is given, and obtain K documents that are closest to the new document, determine the category of the new document based on the category of the K documents (note that this also means that the KNN algorithm does not actually have a "training" phase ). This judgment method can well overcome the defects in rocchio algorithms that cannot handle linear inseparable issues, and is also suitable for classification standards that may change at any time (as long as you delete the old training documents, when a new training document is added, the classification criterion is changed ).

The only drawback of KNN is that when determining the category of a new document, we need to compare it with all existing training documents, this computing cost is not affordable for every system (for example, a text classification system that I want to build has tens of thousands of classes, even if each class has only 20 training samples, to determine the category of a new document, we also need to perform a 0.2 million-time vector comparison !). Some KNN-based improvement methods such as generalized instance set are trying to solve this problem.

Feature Selection Method

You may find it strange that, just talking about the text classification method, how can you suddenly jump to the feature selection. This is because feature selection plays a vital role in text classification and even machine learning. Good feature selection can not only make full use of training data, but also effectively streamline the data volume, thus reducing the computing cost and effectively preventing overfitting and underfitting. Common Feature selection algorithms include mutual information, document frequency, information gain, and square test. The specific feature selection method will be described in the following blog posts. Let's look forward to it together.

Today, I will write about the definition, classification, and common algorithms of text classification. In the future, I will explain different types in detail.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.