I have worked on some text mining projects, such as Webpage Classification, microblog sentiment analysis, and user comment mining. I also packaged libsvm and wrote the text classification software tmsvm. So here we will summarize some of the previous articles on text classification.
1. Basic Knowledge 1. sample text classification is supervised, so samples need to be sorted. Determine the sample tag and number based on business requirements. The sample tag is an integer. In SVM, if binary classification is used, sample labels are generally set to-1 and 1. In Naive Bayes, the values are generally 0 and 1, but they are not fixed, tag settings are related to the nature of the algorithm.
The following samples are sorted: 1 is a positive class, and-1 is an inverse Class (for ease of display, some text in the instant chat tool is used here, and some conversations in it are YY, is not true ).
Table 1.1 example of a training sample
Tag |
Sample |
1 |
If you want to buy products, please add my qq61517891 to contact me to buy! |
1 |
Contact qq11211_282 |
1 |
You need to order, please add a button |
-1 |
Is the mobile phone experience of Sony Ericsson a month? |
-1 |
Sorry, this is the cheapest price. |
-1 |
The one with three items is sold at a high price. |
1.2 Feature Selection
The most famous Feature Extraction Method in text classification is the vector space model (VSM), which converts samples into vectors. To achieve this conversion, we need to do two tasks: determining feature sets and extracting features.
1.2.1 determine the feature set
Feature set is actually a dictionary, and you also need to set a number for each word.
Generally, words in all samples can be extracted As dictionaries, and dictionary numbers can be set at will. By default, the weights of all words are the same. How can we extract meaningful words from samples? The most common method is to use Word Segmentation tools. For example, "if you want to buy a product, please add qq61517891 to contact me to buy it !", It can be divided into "for example ^ buy ^ products ^ please ^ add ^ I ^ qq61517891 ^ contact ^ I ^ buy ^ !", "^" Is used to separate words. Common Word splitting tools include ICTCLAS (C ++) and iksegment (Java ).
Is a typical flowchart for generating dictionaries.
Figure 1.1 flowchart of extracting dictionary from samples
1.2.2 Feature Selection
According to different businesses, the number of dictionaries in text classification ranges from tens to tens of millions. Such a large dimension may cause a dimension disaster. Therefore, we need to find a way to select some representative features from a large number of features without affecting the classification effect (based on the results in the literature, feature Selection can improve the classification effect to a certain extent ). Feature Selection is to select representative words from the feature set. How can we measure the representativeness of words? Common calculation methods include word frequency, Chi-square formula, and information gain. The Chi-square formula is a better method.
The following links are several articles about how to select features.
1.
Differences between http://www.blogjava.net/zhenandaci/archive/2009/04/19/266388.html feature selection and feature weight calculation
2.
Information Gain of http://www.blogjava.net/zhenandaci/archive/2009/03/24/261701.html Feature Selection Method
3.
Open Side test of http://www.blogjava.net/zhenandaci/archive/2008/08/31/225966.html Feature Selection Algorithm
1.2.3 Feature Extraction
Feature Extraction is another way to solve dimensional disasters. Similar to dimensionality reduction, feature extraction adopts an advanced approach compared to feature selection. Topic modeling is used to map high-latitude spaces to low-latitude spaces to reduce dimensionality. For details, see Feature Extraction Section 2.1.
1.3 calculate feature weight
How can we convert a sample to a vector?
First, give a flowchart:
Figure 1.1 feature weight calculation process
Process:
1) First, perform word segmentation on the sample to extract all words.
2) based on the generated dictionary, if the word in the dictionary appears, enter the word frequency of the word in the corresponding position.
3) Normalize the generated Vectors
The method shown above is relatively simple. The feature weight is represented by word frequency. Currently, the commonly used feature weight calculation method is TF * IDF and TF * Rf. For details, refer to 2.3 feature weight.
1.4 model training and Prediction
After converting the text into a vector, most of the work is actually done. The next step is to use algorithms for training and prediction.
Nowadays, there are many algorithms for text classification. common algorithms include na naive ve Bayes, SVM, KNN, and logistic regression. SVM is widely used in the industry and academia, but I know that SVM is not widely used in the company, while logistic regression is commonly used, because it is relatively simple and supports parallel training. The most important thing isSimple and reliable.
As for what these algorithms are, I will not talk about them here, because there are a lot of literature about the relevant algorithms on the network, and there are also a lot of source programs. You can use it directly.
Materials and procedures
1.
Bytes
2.Http://blog.163.com/jiayouweijiewj@126/blog/static/17123217720113115027394/ has analyzed in detail how to implement na incluvebayes in mahout
3. http://www.csie.ntu.edu.tw /~ Cjlin/libsvm is an open-source tool for SVM training and prediction. You can use it directly after downloading it. The author's documents are detailed.
4. Timeout
5. http://blog.pluskid.org /? Page_id = 683
6. https://code.google.com/p/tmsvm/ tmsvm is a program I previously wrote to use SVM for text classification, involving all the processes of text classification.
1.5 Further reading:
Text classification technology has been studied for many years, so there are a lot of related materials. You can further read the following materials.
1.
Http://www.blogjava.net/zhenandaci/category/31868.html? Show = all here is an entry-level series of text classification, which is detailed.
2. "Research on several key issues in Text Mining", this book is very thin, but it has been written in depth, and some key issues in text mining are discussed.
2. Discussion of Several Issues 2.1 Feature Selection
Feature Selection is to select some representative words from the dictionary based on a weight calculation formula. There are many common feature selection methods: Chi, mutual information, and information gain. In addition, TF and IDF can be used as feature selection methods. Many people have done a lot of experiments on this issue. The Chi method is the best, so this method is used in the system (tmsvm. Both Wikipedia and paper have detailed explanations on feature selection.
2.2 Feature Extraction
Feature Extraction and feature selection are both for dimensionality reduction. The feature selection method selects some representative words from the dictionary, and Feature Extraction uses ing to map high latitude space to low latitude space to achieve dimensionality reduction. The most common feature extraction method is Latent Semantic Analysis (Latent Semantic Analysis). lsa is also called topic modeling. the commonly used topic modeling methods include LSA, plsa, and lda. Previously Used Method LSA.
Assume that the original word-document matrix is m term and N documents. Indicates the vector of the J document ., After SVD decomposition, select the first K feature values. Then reorganize the feature vector of the document, so that the new feature vector of the document will be reduced from the original M dimension to k dimension. A new document can be mapped to the U space. In fact, there is another method, that is, but in the experiment, it is found that the previous ing effect is better. In addition, lsa is also described in detail on Wikipedia.
This system uses LSA for classification as a method called local relevancy weighted LSI. The main steps are as follows:
* Model Training
① Training the initial classifier C0
② Predict training samples and generate the initial score
③ Document feature vector Transformation
④ Set the threshold and select the Top N document as the local LSA Region
⑤ Perform SVD decomposition on the local word/document matrix. Obtain the U, S, and vmatrix.
⑥ Map other training samples to the U Space
7. Train all transformed training samples to obtain the LSA classifier.
* Model Prediction
① Obtain the initial score using C0 Prediction
② Document feature vector Transformation
③ Map to u Space
④ Use the LSA model to predict the score
2.3 feature weight calculation
The general formula for calculating the feature weight of the document feature vector is that the weight of the I-th term in the J-th document vector. Local (I, j) is called a local factor, which is related to the number of times a term appears in the document. Global (I), also known as the global factor of term, is related to the appearance of term in the entire training set. Generally, the formulas we are familiar with can be converted to this general expression. For example, TF * IDF is the most commonly used TF format. Therefore, we can calculate the global factor of the term when constructing a dictionary, put the value in the dictionary, and then directly call it when calculating the feature weight.
The specific flowchart is as follows:
Figure 2.3 feature weight calculation process
In classification, which feature weight calculation method is the best ?? TF * IDF? TF * IDF is the most commonly used in the literature, but its effect is fixed. Some people have done some work on it, such as man at the National University of Singapore.
Lan once published an article on ACM and aaai to address this issue. Zhi-hong
Deng also made a systematic comparison of various feature weight methods. The final conclusion is that TF * IDF is not the best, and the simplest TF performs well, some partitioned methods, such as TF * Chi, have poor results.
Later, man LAN published a paper on the term
The weighting method is described in a comprehensive and meticulous manner, and the TF * Rf method proposed by it is demonstrated in various aspects.
2.4 tsvm model training and prediction process
Training process:Automatic SVM model training for text. Including libsvm and liblinear packet selection, word segmentation, dictionary generation, feature selection, SVM parameter selection, and SVM model training. See the following
Figure 2.4 training process of tmsvm Model
Model Prediction Process:
Figure 2.4 simultaneous prediction process for Multiple Models
Model results:
The model returns two results: Label and score. The label is the predicted label. Score indicates the degree of belonging to the class of the sample. The larger the score, the higher the confidence level of the sample. The specific calculation method is based on the formula. k indicates the number of all supported classes, N indicates the number of all classes, and Si indicates the scores of all supported classes. The benefit of returning score is that it is related to information filtering. Because of the unbalance and randomly sampling of the training sample, the accuracy of the result obtained based on the identified tag is low, so threshold control is required.
2.5 SVM Parameter Selection
The two most important parameters in libsvm are C and Gamma. C is the penalty coefficient, that is, the width of the error. The higher C, the more I can't tolerate errors. C is too large or too small, and the generalization capability is poor. Gamma is a parameter that comes with the function after the RBF function is selected as the kernel. Implicitly, the distribution after data is mapped to the new feature space. The larger the GAMMA value, the smaller the support vector, and the smaller the GAMMA value, the more support vectors. The number of SVM affects the training and prediction speed. Chih-Jen
Lin has a detailed introduction on his homepage.
The C parameter of liblinear is also very important.
Therefore, the system uses the 5-flods cross-validation method to perform grid search for C and gamma in a certain range. For details about grid search, refer to the thesis and the grid. py source file in the tool folder of libsvm. Grid Search can obtain the global optimal parameters.
To accelerate the efficiency of SVM parameter search, two types of granularity search are used.Coarse granularityAndFine GranularityThe difference between the two search methods is that the search step is different. Coarse granularity refers to a large search step, in order to find a general region where the optimal solution is located within a large search range. In order to find a specific parameter in a small range, the fine-grained search step is small.
The above method is still time-consuming for large sample files. To further improve efficiency and ensure global optimizationSubsetPerform a coarse-grained search, and then obtain a fine-grained search for the full sample within the obtained optimal range.
2.6 parallelization of SVM Parameter Selection
SVM has been used for a long time in the training process, especially to find the most suitable parameter. Naturally, we can imagine whether the SVM inspection can be parallelized. The method I used previously was to parallelize the parameter selection, while the training of a single parameter was carried out in serial mode on a machine. I put the Training Method on my blog and I will not paste it here.
2. Multi-classification policies for libs and liblinear
2.7 multi-classification policies of libsvm and liblinear
The multi-classification policy of libsvm is one-againt-one. A total of K * (k-1)/2 binary classifier, traversing the K * (k-1)/2 binary classifier values, if the value of classifier in Class I and binary in Class J is greater than 0, one vote will be given to Class I; otherwise, one vote will be given to Class J. Select the class with the largest number of votes as the final class.
While the liblinear policy is one-against-rest. There are a total of K binary classifier. Select the class with the maximum number of values from all binary classifier as the final prediction class label.
Impact of repeated samples on SVM model 2.8 impact of repeated samples on SVM model
How does repeated samples affect the SVM model?
I did an experiment to see the impact of repeated samples.
The original training sample contains a total of positive samples of 1000 and negative samples of 2000. Then, positive samples x 2 are used to construct a positive sample of 2000 and negative sample of 2000. Then a sample containing positive sample 4494 and negative sample 24206 is tested. The final result is as follows:
Figure 2.8 impact of duplicate samples on results
From the result, the F value indicates that the non-repeated sample is slightly higher than the repeated sample (two decimal places are retained in the figure, but the difference does not exceed 0.5% ). In terms of accuracy, duplicate samples are slightly higher than non-duplicate samples.
Then I put the model into a sample containing tens of millions of samples for testing. The specific indicators cannot be measured. However, it is better to repeat the sample.
Specific analysis:
1. A sample is repeated multiple times, which is equivalent to increasing the weight of the sample. There is a weightedinstance in SVM. In normal samples, there will inevitably be some misjudgment. If the same sample appears in both positive and negative classes, the positive class containing duplicate samples will offset the negative impact of the negative class misjudgment samples. In SVM classification, penalty functions are used to control these outliers.
2. If repeated samples are retained, the number of samples is increased. For libsvm, the classification complexity is O (nsv3). If a sample is a SVM, all repeated samples will also be added to the Support Vector. In addition, if appropriate parameters are to be selected for the SVM model, if the SVM selects the RBF kernel function, select the appropriate cost and RBF parameters Gramma, 0.5] for selection, a total of 9*9 = 81 sets of parameters need to be selected. If you want to perform cross-validation of 5-flods under each set of parameters, 81*5 = 405 training and tests are required. If it takes 2 minutes for each training and test (when the sample reaches 0.1 million orders of magnitude, the libsvm training time is calculated almost every minute), it takes 405x2/60 = 12.3 hours in total, therefore, it is not easy to train a good SVM model. Therefore, removing duplicate samples is of great benefit to training efficiency.
2.9 apply categories and filter information
What is the greatest impact on the final effect of classified applications and information filtering? Classification algorithm? Dictionary size? Feature selection? Model parameters? These will affect the final filtering effect, but if the filtering effect is the biggest, It is the sampling of the training sample.
Nowadays, machine learning-based classification algorithms are generally based on the assumption that the distribution of training sets and test sets is consistent, in this way, the effect of the classifier application trained on the training set is more effective than that of the test set.
However, data sets for information filtering are generally on the Internet, while data sets for the Internet are generally hard to be randomly sampled. As shown in: Generally, information filtering or other Internet-facing applications must include P (positive, a sample of interest to the user), n (negative, samples that the user does not care about or is not interested in ). The ideal situation is that P is of interest to users, while N is to remove P from the entire network. It is obvious that N is infinitely large and it is difficult to estimate its true distribution, that is, random sampling is not allowed.
Figure 2.9 sample distribution of Region 1
In the same way, web page classification is used for applications across the Internet. Generally, web page classification applications use Yahoo! You can also organize webpages of specialized websites as initial training samples.
Information Filtering samples in general, interested samples are well sampled randomly. However, it is difficult to select a normal sample relative to the one you are interested in. However, normal samples have a significant impact on the network-wide test results. I have done an experiment:
First, there is a dataset containing 50 thousand samples, with 25 thousand positive samples and 25 thousand negative samples. The negative sample here is an incorrect sample that was previously found using the keyword method. Use 40 thousand samples for training and 10 thousand samples for testing. The cross-validation results of the trained model can reach more than 97%. Test results in the test sample, and then select a threshold of 0.9, which indicates that the recall rate can reach 93% and the accuracy rate is 96%.
Then, put the model into a model containing tens of millions of records for testing and set the threshold to 0.9. A total of 3 million suspected violation samples were found. For this experiment, there are too many recall samples and the accuracy is very low.
Then, I changed the normal sample. 30 thousand samples are randomly sampled from the tens of millions of samples. After verification, the positive samples are removed. About 20 thousand samples are left in the training sample for re-training.
Put the new model into a sample of 0.9 million and set the threshold to 0.15 million. A total of suspected samples were found. The accuracy can reach about 70%. Therefore, the random selection of normal samples is equally important for classification.
Here is a small example:
The left graph uses P and N to train the model. There is an unknown class C in the picture on the right. According to the known model, it should be divided into P, but actually it does not belong to P. In general, the threshold value can be used for control in this case.
Figure 2.9 Classification 2 used for Information Filtering
2.10 SVM solves the problem of sample skew
The so-called unbalanced data refers to the large difference in the number of samples of the two classes involved in the classification (or multiple classes. For example, the positive class has 10,000 samples, while the negative class only gives 100 samples, which causes obvious problems. You can look at the figure below:
Figure 2.10 sample skew example
A square vertex is a negative class. H, H1, H2 are the classification surface calculated based on the given sample. Because there are very few negative samples, some original negative samples are not provided, if the two gray square points are provided, the calculated classification surface should be H ', H2' and H1. They are obviously different from the previous results, in fact, the more sample points the negative class gives, the more likely it will appear near the gray point, and the closer we calculate the result is to the real classification surface. However, due to the skew, a large number of positive classes can push negative classes, thus affecting the accuracy of the results.
For specific solutions, please refer to the articles on my blog. I will not post them here.
2.11 others
There are many problems with text classification. Before that, I want to write how to classify short articles (such as queries) and use Wikipedia knowledge to enhance text classification, how to Use unlabeled samples to improve the classification effect. There is not much time now. If you have time, continue to write it in depth.