http://blog.csdn.net/abcjennifer/article/details/23615947
Text mining paper did not find a unified benchmark, had to run their own procedures, passing through the predecessors if you know 20newsgroups or other useful public data set classification (preferably all class classification results, All or take part of the feature does not matter) trouble message to inform the benchmark now, million thanks!
Well, say the text. The 20newsgroups website gives 3 datasets, here we use the most primitive 20news-19997.tar.gz.
It is divided into the following processes:
- Load a data set
- Tim Feature
- Classification
Description: SciPy official online for reference, but look a bit messy, and there are bugs. In this article we look at the block. Environment:python 2.7 + Scipy (Scikit-learn)
1. Load Data SetDownload the dataset from 20news-19997.tar.gz, unzip it into the Scikit_learn_data folder, load the data, and see the code comment.
[Python]View Plaincopy
- #first Extract the News_group dataset To/scikit_learn_data
- From sklearn.datasets import fetch_20newsgroups
- #all categories
- #newsgroup_train = fetch_20newsgroups (subset= ' train ')
- #part Categories
- Categories = [' Comp.graphics ',
- ' Comp.os.ms-windows.misc ',
- ' Comp.sys.ibm.pc.hardware ',
- ' Comp.sys.mac.hardware ',
- ' comp.windows.x '];
- Newsgroup_train = fetch_20newsgroups (subset = ' Train ', categories = categories);
You can check if the load is good:
[Python]View Plaincopy
- #print category names
- From Pprint import Pprint
- Pprint (List (newsgroup_train.target_names))
Result: [' comp.graphics ',
' Comp.os.ms-windows.misc ',
' Comp.sys.ibm.pc.hardware ',
' Comp.sys.mac.hardware ',
' Comp.windows.x ']
2. Mention feature:Just load came in Newsgroup_train is an article document, we want to extract feature, that is, the word frequency ah God horse, with fit_transform Method 1. Hashingvectorizer, specify the number of feature
[Python]View Plaincopy
- #newsgroup_train. Data is the original documents and we need to extract the
- #feature vectors inorder to model the text data
- From Sklearn.feature_extraction.text import Hashingvectorizer
- Vectorizer = Hashingvectorizer (stop_words = ' 中文版 ', non_negative = True,
- N_features = 10000)
- Fea_train = Vectorizer.fit_transform (newsgroup_train.data)
- Fea_test = Vectorizer.fit_transform (Newsgroups_test.data);
- #return feature vector ' Fea_train ' [n_samples,n_features]
- Print ' Size of Fea_train: ' + repr (fea_train.shape)
- Print ' Size of Fea_train: ' + repr (fea_test.shape)
- #11314 documents, 130107 vectors for all categories
- Print ' The average feature sparsity is {0:.3f}% '. Format (
- Fea_train.nnz/float (fea_train.shape[0]*fea_train.shape[1]) *100);
Result: Size of Fea_train: (2936, 10000)
Size of Fea_train: (1955, 10000)
The average feature sparsity is 1.002% because we only take 10,000 words, namely 10000 dimensional feature, the sparsity is not low. In fact, with tfidfvectorizer statistics can get tens of thousands of feature, I counted the entire sample is 13w multidimensional, is a fairly sparse matrix. *************************************************************************************************************** ***********
The above code comment says that TF-IDF is different from the feature dimensions extracted on train and test, so how do you make them the same? There are two ways of doing this:
Method 2. Countvectorizer+tfidftransformer let two countvectorizer share vocabulary:
[Python]View Plaincopy
- #----------------------------------------------------
- #method 1:countvectorizer+tfidftransformer
- Print ' *************************\ncountvectorizer+tfidftransformer\n************************* '
- From Sklearn.feature_extraction.text import Countvectorizer,tfidftransformer
- count_v1= Countvectorizer (stop_words = ' 中文版 ', MAX_DF = 0.5);
- Counts_train = Count_v1.fit_transform (Newsgroup_train.data);
- Print "The Shape of Train is" +repr (counts_train.shape)
- COUNT_V2 = Countvectorizer (Vocabulary=count_v1.vocabulary_);
- Counts_test = Count_v2.fit_transform (Newsgroups_test.data);
- Print "The shape of Test is" +repr (counts_test.shape)
- Tfidftransformer = Tfidftransformer ();
- Tfidf_train = Tfidftransformer.fit (Counts_train). Transform (Counts_train);
- Tfidf_test = Tfidftransformer.fit (counts_test). Transform (Counts_test);
Results: *************************countvectorizer+tfidftransformer
*************************
The shape of Train is (2936, 66433)
The shape of test is (1955, 66433) Method 3. Tfidfvectorizer let two tfidfvectorizer share vocabulary:
[Python]View Plaincopy
- #method 2:tfidfvectorizer
- Print ' *************************\ntfidfvectorizer\n************************* '
- From Sklearn.feature_extraction.text import Tfidfvectorizer
- TV = Tfidfvectorizer (SUBLINEAR_TF = True,
- MAX_DF = 0.5,
- Stop_words = ' 中文版 ');
- Tfidf_train_2 = Tv.fit_transform (Newsgroup_train.data);
- TV2 = Tfidfvectorizer (vocabulary = tv.vocabulary_);
- Tfidf_test_2 = Tv2.fit_transform (Newsgroups_test.data);
- Print "The Shape of Train is" +repr (tfidf_train_2.shape)
- Print "The shape of Test is" +repr (tfidf_test_2.shape)
- Analyze = Tv.build_analyzer ()
- Tv.get_feature_names ()#statistical features/terms
Results: *************************
Tfidfvectorizer
*************************
The shape of Train is (2936, 66433)
The shape of test is (1955, 66433) In addition, there are sklearn in the package of the feature function, fetch_20newsgroups_vectorized Method 4. Fetch_20newsgroups_vectorized but this method can not pick out a few classes of feature, only all 20 classes of feature all out:
[Python]View Plaincopy
- Print ' *************************\nfetch_20newsgroups_vectorized\n************************* '
- From sklearn.datasets import fetch_20newsgroups_vectorized
- Tfidf_train_3 = fetch_20newsgroups_vectorized (subset = ' train ');
- Tfidf_test_3 = fetch_20newsgroups_vectorized (subset = ' Test ');
- Print "The Shape of Train is" +repr (tfidf_train_3.data.shape)
- Print "The shape of Test is" +repr (tfidf_test_3.data.shape)
Results: *************************
Fetch_20newsgroups_vectorized
*************************
The shape of Train is (11314, 130107)
The shape of test is (7532, 130107)
3. Classification
3.1 multinomial Naive Bayes ClassifierSee code &comment, do not explain
[Python]View Plaincopy
- ######################################################
- #Multinomial Naive Bayes Classifier
- Print ' *************************\nnaive bayes\n************************* '
- From Sklearn.naive_bayes import MULTINOMIALNB
- From Sklearn Import metrics
- Newsgroups_test = fetch_20newsgroups (subset = ' Test ',
- Categories = categories);
- Fea_test = Vectorizer.fit_transform (Newsgroups_test.data);
- #create the multinomial Naive Bayesian Classifier
- CLF = MULTINOMIALNB (alpha = 0.01)
- Clf.fit (Fea_train,newsgroup_train.target);
- pred = Clf.predict (fea_test);
- Calculate_result (newsgroups_test.target,pred);
- #notice Here we can see that f1_score are not equal to 2*precision*recall/(Precision+recall)
- #because the m_precision and M_recall we get is averaged, however, Metrics.f1_score () calculates
- #weithed average, i.e, takes into the number of each class into consideration.
Note my last 3 lines of comments, why f1≠2* (accuracy * recall rate)/(accuracy + recall rate)
Where the function Calculate_result computes F1:
[Python]View Plaincopy
- def calculate_result (actual,pred):
- M_precision = Metrics.precision_score (actual,pred);
- M_recall = Metrics.recall_score (actual,pred);
- print ' predict info: '
- print ' precision:{0:.3f} '. Format (m_precision)
- print ' recall:{0:0.3f} '. Format (M_recall);
- print ' f1-score:{0:.3f} '. Format (Metrics.f1_score (actual,pred));
3.2 KNN:
[Python]View Plaincopy
- ######################################################
- #KNN Classifier
- From sklearn.neighbors import kneighborsclassifier
- Print ' *************************\nknn\n************************* '
- KNNCLF = Kneighborsclassifier ()#default with k=5
- Knnclf.fit (Fea_train,newsgroup_train.target)
- pred = Knnclf.predict (fea_test);
- Calculate_result (newsgroups_test.target,pred);
3.3 SVM:
[CPP]View Plaincopy
- ######################################################
- #SVM Classifier
- From SKLEARN.SVM import SVC
- Print ' *************************\nsvm\n************************* '
- SVCLF = SVC (kernel = ' linear ') #default with ' RBF '
- Svclf.fit (Fea_train,newsgroup_train.target)
- pred = Svclf.predict (fea_test);
- Calculate_result (newsgroups_test.target,pred);
Results:
*************************
Naive Bayes
*************************
Predict info:
precision:0.764
recall:0.759
f1-score:0.760
*************************
Knn
*************************
Predict info:
precision:0.642
recall:0.635
f1-score:0.636
*************************
Svm
*************************
Predict info:
precision:0.777
recall:0.774
f1-score:0.774
4. Clustering
[CPP]View Plaincopy
- ######################################################
- #KMeans Cluster
- From Sklearn.cluster import Kmeans
- Print ' *************************\nkmeans\n************************* '
- pred = Kmeans (n_clusters=5)
- Pred.fit (Fea_test)
- Calculate_result (Newsgroups_test.target,pred.labels_);
Results:
*************************
Kmeans
*************************
Predict info:
precision:0.264
recall:0.226
f1-score:0.213
This article all code download: here
It seems that the accuracy rate is very low ... Let's use all the features ... The results are as follows:
*************************
Naive Bayes
*************************
Predict info:
precision:0.771
recall:0.770
f1-score:0.769
*************************
Knn
*************************
Predict info:
precision:0.652
recall:0.645
f1-score:0.645
*************************
Svm
*************************
Predict info:
precision:0.819
recall:0.816
f1-score:0.816
*************************
Kmeans
*************************
Predict info:
precision:0.289
recall:0.313
f1-score:0.266
More learning materials about Python will continue to be updated, so stay tuned for this blog and Sina Weibo Rachel Zhang.
Apply Scikit-learn to do text categorization