Approximate steps for Chinese text categorization

Source: Internet
Author: User
Tags svm idf

Text categorization problem: given document P (may contain title T), categorize documents into one or more of the N categories
Text classification applications: Common spam identification, sentiment analysis
Text classification direction: mainly two categories, multi-classification, multi-label classification
Text Classification methods: Traditional machine learning Methods (Bayesian, SVM, etc.), deep learning methods (FASTTEXT,TEXTCNN, etc.)
The processing of text categorization is mainly divided into text preprocessing, text feature extraction, classification model construction and so on. Compared with the English text processing classification, the preprocessing of Chinese text is the key technology.

First, Chinese word segmentation : For Chinese text classification, a key technology is the Chinese word segmentation. Feature granularity is much better than word granularity, and most of its classification algorithms do not consider word order information, which is based on the loss of N-gram information. The following is a brief summary of Chinese word segmentation technology: Word segmentation based on string matching, Word segmentation based on understanding and statistics-based word segmentation method. For specific reference: Chinese word segmentation principle and Word breaker tool Introduction 73948971/

1, the word segmentation method based on string matching:
Process: This is a dictionary-based Chinese word segmentation, the core is to first establish a unified dictionary table, when it is necessary to break a sentence, the first sentence is split into multiple parts, each part and dictionary one by one corresponds to, if the word in the dictionary, the word segmentation succeeds, or continue to split matching until successful.
Core: Dictionaries, segmentation rules, and match order are the cores.
Analysis: The advantage is that the speed is fast, the time complexity can keep in O (n), the realization is simple, the effect is fair, but the ambiguity and the non-sign word processing effect is not good.

2, based on the understanding of the Word segmentation method :

Based on the understanding of the word segmentation method is to make the computer simulation of the understanding of the sentence, to achieve the effect of recognition words. The basic idea is to make syntactic and semantic analysis at the same time, using syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word breaker subsystem, the French sub-system, and the general control section. Under the coordination of the general control part, the word segmentation subsystem can get the syntactic and semantic information about words and sentences to judge the ambiguity of the participle, that is, it simulates the process of human understanding of the sentence. This segmentation method requires a lot of language knowledge and information. Because of the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into the form that machine can read directly, so the word segmentation system based on understanding is still in the experimental stage.

3, statistics-based segmentation method :
Process: Statistics that the word segmentation is a probability maximization problem, that is, split sentences, based on corpus, statistics of the adjacent word formation of the probability of the occurrence of the number of adjacent words, the probability of the occurrence of large, according to the probability of the value of the word segmentation, so a complete corpus is very important.
The main statistical models are: N-ary Grammar model (N-gram), Hidden Markov model (Hidden Markov model, HMM), Maximum entropy model (ME), Conditional random field model (Conditional Random FIELDS,CRF), etc.

Second, text preprocessing :
1, participle : Chinese task participle is necessary, the general use of Jieba participle, industry leader.
2, to stop the use of words : the establishment of a dictionary of discontinued words, the current stop Word dictionary has 2000 or so, stop using the word mainly includes some adverbs, adjectives and some of the conjunctions. By maintaining a deactivated thesaurus, it is actually a process of feature extraction, which is essentially part of the feature selection.
3, part-of- speech tagging : After the word segmentation judgment (verb, noun, adjective, adverb ...) ), which can be obtained by setting parameters when using Jieba participle.

text Feature Engineering : The core of text categorization is how to extract the key features that can embody the characters of text from the text, and to capture the mapping between the features and the categories. So feature engineering is important and can be made up of four parts:

1, based on the characteristics of the word bag model : The word unit (unigram) constructs the word bag may reach tens of thousands of dimensions, if considers the binary phrase (Bigram), the ternary phrase (trigram) the word bag size may have hundreds of thousands of of many, Therefore, the feature representation based on the word bag model is usually extremely sparse.
(1) There are three ways to characterize a word bag:

Naive version: Do not consider the frequency of the occurrence of words, as long as there is in the corresponding position marked 1, otherwise 0;
Consider the word frequency (that is, term frequency): that the more words appear in a text the more important, so the weight is greater;
Consider the importance of words: to characterize the importance of a word with TF-IDF. TF-IDF reflects the idea of a compromise: in a document, TF believes that the greater the number of times a word appears, the more important it might be, but it may not be (such as a stop word: "yes" or something); The IDF believes that the fewer documents a word has, the less important it is, but it may not be (such as some meaningless words)
(2) Advantages and disadvantages:

Advantage: The word bag model is simple and intuitive, it can usually learn some keywords and the mapping between categories
Cons: Missing the sequential information appearing in text morphemes; only the words are symbolized, without regard to the semantic connection between the words (for example, "microphone" and "microphone" are different words, but the semantics are the same);


2, based on the characteristic representation of embedding : The character of text is calculated by the word vector. (mainly for short text)

Averaging: the sum (or average) of each word vector of a short text is represented as a vector of texts;
Network features: Using a pre-train good NN model to get the text as the last layer of input vector representation;


3, based on the characteristics of the NN model extraction : The advantage of NN is that it can end2end the training and testing of the model, use the nonlinearity of the model and many parameters to learn the features without the need to extract the features manually. CNN is good at capturing the key local information in text, while RNN is good at capturing contextual information (considering word order information) and having some memory ability.

4, based on the features extracted from the task itself : mainly for the specific task of the design, through our observation and perception of the data, may be able to find some useful features. Sometimes, these manual features improve the final classification effect greatly. For example, for a positive negative comment classification task, for negative comments, the number of negative words included is a strong feature of one dimension.

5, feature Fusion : For the case of high feature dimension and complicated data pattern, it is suggested to use a non-linear model (such as the popular GDBT, Xgboost), and a simple linear model (such as LR) is recommended for low feature dimension and simple data schema.

6, theme features :
LDA (document Topic): You can assume that a document set has a T topic, a document may belong to one or more topics, and the LDA model calculates the probability that a document belongs to a topic, so that a DXT matrix can be computed. LDA features perform well on tasks such as document tagging.
LSI (latent semantics of the document): The potential semantics of the document are computed by decomposing the document-word frequency matrix, which is a bit similar to LDA, and is a potential feature of the document.

Iv. Text classification , traditional machine learning methods: This part is not the focus, traditional machine learning algorithms can be used to classify the model can be used, the common are: NB model, random forest model (RF), SVM classification model, KNN classification model, neural network classification model.
Here's a little bit of a Bayesian model, because industry uses this model to identify spam, specific reference: text classification with naive Bayes 50597149

Five, deep learning text classification model :
1,Fasttext model : Fasttext is a paper published by Word2vec author Mikolov in July 16 after he moved to Facebook: Bag of Tricks for efficient Text classific ation
Model:
Principle: All the word vectors in the sentence are averaged (in a sense it can be understood as only one AVG pooling special CNN) and then directly connected to the Softmax layer.

2,textcnn: Use CNN to extract key information similar to N-gram in sentences.
Model:
Improvement: The network result in Fasttext is completely without regard to the word order information, and TEXTCNN extracts the key information similar to N-gram in the sentence.

3,textrnn:
Model: Bi-directional RNN (actually using bidirectional lstm) can be understood in a sense to capture a variable length and bidirectional "N-gram" message.
Improvement: CNN has one of the biggest problems is fixed filter_size field of view, on one hand can not model longer sequence information, on the other hand, the filter_size of the parameter adjustment is very cumbersome.

4,Textrnn + Attention:
Model:
Improvement: Attention (Attention) mechanism is a common modeling long-time memory mechanism in the field of natural language processing, which can give a very intuitive contribution to the results of each word, and is basically the standard of SEQ2SEQ model. In fact, the text classification can be understood as a special kind of seq2seq in a sense, so the attention mechanism is considered to be introduced recently.

5,textrcnn (textrnn + CNN):
Model:

Paper: Recurrent convolutional neural Networks for Text classification

6, Deep learning experience:
The model is obviously not the most important: Good model design is important for getting good results, and it is also the focus of academic attention. But in actual use, the workload of the model occupies a relatively small amount of time. Although the second part introduces 5 kinds of CNN/RNN and their variants of the model, the actual text classification task alone with CNN has been enough to achieve good results, our experimental test rcnn on the accuracy of the increase of about 1%, is not very significant. The best practice is to first use the TEXTCNN model to debug the overall task effect to the best, and then try to improve the model.

Understand your data: While there is a big advantage in applying deep learning that there is no need for cumbersome and inefficient manual feature engineering, if you just treat him as a black box, you will often suspect life. Be sure to understand your data and remember that data sense is always important regardless of the traditional approach or deep learning approach. Pay attention to badcase analysis, understand whether your data is suitable, why it is wrong.

Parameter adjustment: can refer to deep Learning Network assistant skills-https://zhuanlan.zhihu.com/p/24720954?utm_source=zhihu&utm_medium=social column

Be sure to use dropout: there are two things you can do: The amount of data is very small, or you use a better regular method, such as bn. In fact, we have tried different parameters of the dropout, the best is 0.5, so if your computing resources are limited, the default of 0.5 is a good choice.

Not necessarily Softmax loss: This depends on your data, if your task is multiple categories of non-mutex, you can try to train more than two classifiers, that is, the problem is defined as multi lable rather than multi class, we adjusted the accuracy rate or increase >1%.

Class imbalance problem: basically a conclusion that has been validated in many scenarios: if your loss is dominate by a subset of categories, it is mostly negative for the general. It is recommended that you try a similar Booststrap method to adjust the sample weights in the loss solution.

Avoid training shocks: By default, you must increase the random sampling factor to make the data distribution IID as far as possible, the default shuffle mechanism can make the training result more stable. If the training model is still volatile, consider adjusting the learning rate or mini_batch_size.


---------------------
Original: 80774668?utm_source=copy

Approximate steps for Chinese text categorization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.