Text classification Combat

Source: Internet
Author: User
Tags knowledge base idf

    • Text classification Combat
      • Classification tasks
      • Algorithmic flow
        • Data labeling
        • Feature Extraction
        • Feature Selection
        • Classifier
        • Training and evaluation
      • Keng
        • Word segmentation
        • Feature Importance degree
        • Biased training Set
        • Model Size Optimization
      • One more Thing ...
        • Term extensions
        • Distributed representation
Classification tasks

In fact, the requirements of the text classification of the project is quite a lot of, mainly can be divided into the following two categories, and for each class gave two examples.
Two categories

Sex News Categories This is a two classification problem for an unbalanced data set, because the number of pornographic news is much smaller than the number of non-pornographic news.

Judging whether the medical query this is related to the search, remember the Putian department "incident?" This classification is very valuable.

Multi-classification

Commodity automatic classification every day merchants are on the new, these new products should be put to "clothing" category or "digital" category, artificial make sure it is not cost-effective.

Query Domain Classification This is to focus on the domain knowledge Base, to do a more dazzling and "smart" so-called "box computing."

Algorithm Flow Data labeling

Classification is supervised learning, so it is necessary to label the data. At first it was necessary to find someone, such as an operating colleague, to hand-label some data. With this batch of seed data, we can train a classifier, use this classifier to classify, the later work is actually relatively easy, you can use some semi-supervised method to continuously expand the tag data set, and the parts of which are not sure to do manual labeling. This semi-artificial semi-automatic method can greatly reduce the workload of marking and enlarge the marker data set rapidly.

Feature Extraction

Classification is based on features, and how to extract the distinguishing characteristics after getting the data is a key step. The features we use are mainly based on the bag of Word, and the steps are as follows:

1. Participle: jieba good enough, mainly the maintenance of the word bank.

2. N-gram: It is possible to use the Tri-gram.

3. Term position: mainly for the query classification is useful.

4. Feature weights: For short text IDF enough, TF-IDF for long texts.

Feature Selection

There are a lot of words, is not each as a feature? No, to remove features that are not distinguishable and that are counterproductive to the classification. Feature selection criteria are as follows:

1. Word frequency limit: The number of occurrences in the total corpus is less than a certain threshold to be deleted

2. Stop words: In addition to the generic rotten street words like "", each application scenario needs to maintain its own stop vocabulary. For example, "Samsung" in the classification of goods is by no means a stop word, but in the sex News category can be discontinued words

3. Feature filter: For a relatively balanced sample set, the information gain ratio (information gain ratio, IGR) can be used directly. It is not appropriate to use IGR for unbalanced sample sets (such as pornography), which can be used with the filter based on odds ratio:

This side of the smoothing can be used based on expected likelihood estimation (ELE). With these two, it's almost as if you feel the feature selection.

Classifier

First of all, in a shallow machine learning system (as opposed to deep learning with feature learning), feature engineering is often more important than classifier selection. In the text classification this question, my experimental result is the discriminant model comparison reliable, the following two results are similar, may choose one to use. I used to choose Max Entropy myself when I was doing bow feature text-related tasks.

1. SVM

2. Max Entropy

For the classic spam classification problem using the naive Bayes, the results are not as good as the above two discriminant models.

Some people also use decision tree-based methods, but in the case of sparse feature vectors, do not know how the effect, have not tried.

Training and evaluation

In training the classifier, we use the 10-fold cross-validation method to prevent a single validation set from being biased.
The evaluation index is a very common recall rate, accurate rate, F1 score, for multi-classification can also be used confusion matrix.

Pit participle

In terms of pragmatism, word segmentation mainly consists in maintaining dictionaries. Hmm or CRF's contribution to the final result is not as visible as the contribution of the dictionary. In addition, for the application scenario, you may need to do something extra. For example, for mobile phone models to make some regular-based participle to the Millet 4 into a word, not millet 4. This small trick on the granularity of the word segmentation is based on your application scenario.

Feature Importance degree

In addition to TF-IDF, there are different requirements for feature importance depending on the application scenario. such as query will give the noun and the proper noun to increase the weight, and pornographic news recognition of adjectives and nouns are equally important.

Biased training Set

The hypothesis of the classifier is that the sample number is roughly evenly distributed, if the class is much larger than other classes of data, it is easy for the classifier to push other classes of data to the larger class in exchange for the smallest average error. In this case, our practice is to try different sample proportions to train and test on the test set, choosing the best proportion of the results.
Another common method is to set the meta cost, which is to set the weight for costs, for example, a non-pornographic loss by 10.
The ultimate goal of both approaches is the same, and the first is better to operate.

Model Size Optimization

Sometimes the size of the model is required, such as the model will be pushed to the device consumption and device memory, for example, even in the cloud, the entire system has memory requirements, I hope you optimize. This time there are two main methods:

A. Feature Selection. Increase the strength of feature selection, for example, the original selected Top 1000, now select Top 500, this can be analogous to the main component analysis, will reduce the model of the indicator data, so it is lossy.

B. Streamline the model. You can remove some of the features in the final model that are particularly small or 0, and then retrain a new model based on the remaining features. This is actually feature warping. The regularization method can also be used to increase the regularization factor.

One more thing...term extension

This is especially true for query, and some of the query is particularly short resulting in low recalls. This time can be extended to him through the term more synonymous, synonyms, co-existing words, increased recall. The term extension is mainly based on this type of feature.

Distributed representation

The eigenvector we use above is the form of each term or term combination that occupies a hole in the vector, and this method can achieve good results. When it has a fatal weakness is that it actually just remembers a combination of some words, and does not really represent the word. One example is that the distance between the two vectors of "moon" and "Moon" is actually the same as the distance between the moon and the cat, which is not in line with the actual situation. In order for text categorization to handle this situation, distributed representation is required. There are two main categories:

A. Thematic model-based: pLSA, LDA, etc.

B. Based on DNN: Word2vec, etc.

Text classification Combat

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.