Apply Scikit-learn to do text categorization

Last Update:2015-01-06 Source: Internet

Author: User

Tags svm pprint

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

http://blog.csdn.net/abcjennifer/article/details/23615947

Text mining paper did not find a unified benchmark, had to run their own procedures, passing through the predecessors if you know 20newsgroups or other useful public data set classification (preferably all class classification results, All or take part of the feature does not matter) trouble message to inform the benchmark now, million thanks!

Well, say the text. The 20newsgroups website gives 3 datasets, here we use the most primitive 20news-19997.tar.gz.

It is divided into the following processes:

Load a data set
Tim Feature
Classification

Naive Bayes
Knn
Svm

Clustering

Description: SciPy official online for reference, but look a bit messy, and there are bugs. In this article we look at the block. Environment:python 2.7 + Scipy (Scikit-learn) 1. Load Data SetDownload the dataset from 20news-19997.tar.gz, unzip it into the Scikit_learn_data folder, load the data, and see the code comment. [Python]View Plaincopy

#first Extract the News_group dataset To/scikit_learn_data
From sklearn.datasets import fetch_20newsgroups
#all categories
#newsgroup_train = fetch_20newsgroups (subset= ' train ')
#part Categories
Categories = [' Comp.graphics ',
' Comp.os.ms-windows.misc ',
' Comp.sys.ibm.pc.hardware ',
' Comp.sys.mac.hardware ',
' comp.windows.x '];
Newsgroup_train = fetch_20newsgroups (subset = ' Train ', categories = categories);

You can check if the load is good: [Python]View Plaincopy

#print category names
From Pprint import Pprint
Pprint (List (newsgroup_train.target_names))

Result: [' comp.graphics ',
' Comp.os.ms-windows.misc ',
' Comp.sys.ibm.pc.hardware ',
' Comp.sys.mac.hardware ',
' Comp.windows.x '] 2. Mention feature:Just load came in Newsgroup_train is an article document, we want to extract feature, that is, the word frequency ah God horse, with fit_transform Method 1. Hashingvectorizer, specify the number of feature [Python]View Plaincopy

#newsgroup_train. Data is the original documents and we need to extract the
#feature vectors inorder to model the text data
From Sklearn.feature_extraction.text import Hashingvectorizer
Vectorizer = Hashingvectorizer (stop_words = ' 中文版 ', non_negative = True,
N_features = 10000)
Fea_train = Vectorizer.fit_transform (newsgroup_train.data)
Fea_test = Vectorizer.fit_transform (Newsgroups_test.data);
#return feature vector ' Fea_train ' [n_samples,n_features]
Print ' Size of Fea_train: ' + repr (fea_train.shape)
Print ' Size of Fea_train: ' + repr (fea_test.shape)
#11314 documents, 130107 vectors for all categories
Print ' The average feature sparsity is {0:.3f}% '. Format (
Fea_train.nnz/float (fea_train.shape[0]*fea_train.shape[1]) *100);

Result: Size of Fea_train: (2936, 10000)
Size of Fea_train: (1955, 10000)
The average feature sparsity is 1.002% because we only take 10,000 words, namely 10000 dimensional feature, the sparsity is not low. In fact, with tfidfvectorizer statistics can get tens of thousands of feature, I counted the entire sample is 13w multidimensional, is a fairly sparse matrix. *************************************************************************************************************** ***********

The above code comment says that TF-IDF is different from the feature dimensions extracted on train and test, so how do you make them the same? There are two ways of doing this:

Method 2. Countvectorizer+tfidftransformer let two countvectorizer share vocabulary: [Python]View Plaincopy

#----------------------------------------------------
#method 1:countvectorizer+tfidftransformer
Print ' *************************\ncountvectorizer+tfidftransformer\n************************* '
From Sklearn.feature_extraction.text import Countvectorizer,tfidftransformer
count_v1= Countvectorizer (stop_words = ' 中文版 ', MAX_DF = 0.5);
Counts_train = Count_v1.fit_transform (Newsgroup_train.data);
Print "The Shape of Train is" +repr (counts_train.shape)
COUNT_V2 = Countvectorizer (Vocabulary=count_v1.vocabulary_);
Counts_test = Count_v2.fit_transform (Newsgroups_test.data);
Print "The shape of Test is" +repr (counts_test.shape)
Tfidftransformer = Tfidftransformer ();
Tfidf_train = Tfidftransformer.fit (Counts_train). Transform (Counts_train);
Tfidf_test = Tfidftransformer.fit (counts_test). Transform (Counts_test);

Results: *************************countvectorizer+tfidftransformer
*************************
The shape of Train is (2936, 66433)
The shape of test is (1955, 66433) Method 3. Tfidfvectorizer let two tfidfvectorizer share vocabulary: [Python]View Plaincopy

#method 2:tfidfvectorizer
Print ' *************************\ntfidfvectorizer\n************************* '
From Sklearn.feature_extraction.text import Tfidfvectorizer
TV = Tfidfvectorizer (SUBLINEAR_TF = True,
MAX_DF = 0.5,
Stop_words = ' 中文版 ');
Tfidf_train_2 = Tv.fit_transform (Newsgroup_train.data);
TV2 = Tfidfvectorizer (vocabulary = tv.vocabulary_);
Tfidf_test_2 = Tv2.fit_transform (Newsgroups_test.data);
Print "The Shape of Train is" +repr (tfidf_train_2.shape)
Print "The shape of Test is" +repr (tfidf_test_2.shape)
Analyze = Tv.build_analyzer ()
Tv.get_feature_names ()#statistical features/terms

Results: *************************
Tfidfvectorizer
*************************
The shape of Train is (2936, 66433)
The shape of test is (1955, 66433) In addition, there are sklearn in the package of the feature function, fetch_20newsgroups_vectorized Method 4. Fetch_20newsgroups_vectorized but this method can not pick out a few classes of feature, only all 20 classes of feature all out: [Python]View Plaincopy

Print ' *************************\nfetch_20newsgroups_vectorized\n************************* '
From sklearn.datasets import fetch_20newsgroups_vectorized
Tfidf_train_3 = fetch_20newsgroups_vectorized (subset = ' train ');
Tfidf_test_3 = fetch_20newsgroups_vectorized (subset = ' Test ');
Print "The Shape of Train is" +repr (tfidf_train_3.data.shape)
Print "The shape of Test is" +repr (tfidf_test_3.data.shape)

Results: *************************
Fetch_20newsgroups_vectorized
*************************
The shape of Train is (11314, 130107)
The shape of test is (7532, 130107) 3. Classification 3.1 multinomial Naive Bayes ClassifierSee code &comment, do not explain [Python]View Plaincopy

######################################################
#Multinomial Naive Bayes Classifier
Print ' *************************\nnaive bayes\n************************* '
From Sklearn.naive_bayes import MULTINOMIALNB
From Sklearn Import metrics
Newsgroups_test = fetch_20newsgroups (subset = ' Test ',
Categories = categories);
Fea_test = Vectorizer.fit_transform (Newsgroups_test.data);
#create the multinomial Naive Bayesian Classifier
CLF = MULTINOMIALNB (alpha = 0.01)
Clf.fit (Fea_train,newsgroup_train.target);
pred = Clf.predict (fea_test);
Calculate_result (newsgroups_test.target,pred);
#notice Here we can see that f1_score are not equal to 2*precision*recall/(Precision+recall)
#because the m_precision and M_recall we get is averaged, however, Metrics.f1_score () calculates
#weithed average, i.e, takes into the number of each class into consideration.

Note my last 3 lines of comments, why f1≠2* (accuracy * recall rate)/(accuracy + recall rate)

Where the function Calculate_result computes F1:

[Python]View Plaincopy

def calculate_result (actual,pred):
M_precision = Metrics.precision_score (actual,pred);
M_recall = Metrics.recall_score (actual,pred);
print ' predict info: '
print ' precision:{0:.3f} '. Format (m_precision)
print ' recall:{0:0.3f} '. Format (M_recall);
print ' f1-score:{0:.3f} '. Format (Metrics.f1_score (actual,pred));

3.2 KNN:

[Python]View Plaincopy

######################################################
#KNN Classifier
From sklearn.neighbors import kneighborsclassifier
Print ' *************************\nknn\n************************* '
KNNCLF = Kneighborsclassifier ()#default with k=5
Knnclf.fit (Fea_train,newsgroup_train.target)
pred = Knnclf.predict (fea_test);
Calculate_result (newsgroups_test.target,pred);

3.3 SVM:

[CPP]View Plaincopy

######################################################
#SVM Classifier
From SKLEARN.SVM import SVC
Print ' *************************\nsvm\n************************* '
SVCLF = SVC (kernel = ' linear ') #default with ' RBF '
Svclf.fit (Fea_train,newsgroup_train.target)
pred = Svclf.predict (fea_test);
Calculate_result (newsgroups_test.target,pred);

Results:

*************************

Naive Bayes
*************************
Predict info:
precision:0.764
recall:0.759
f1-score:0.760
*************************
Knn
*************************
Predict info:
precision:0.642
recall:0.635
f1-score:0.636
*************************
Svm
*************************
Predict info:
precision:0.777
recall:0.774
f1-score:0.774

4. Clustering

[CPP]View Plaincopy

######################################################
#KMeans Cluster
From Sklearn.cluster import Kmeans
Print ' *************************\nkmeans\n************************* '
pred = Kmeans (n_clusters=5)
Pred.fit (Fea_test)
Calculate_result (Newsgroups_test.target,pred.labels_);

Results:

*************************
Kmeans
*************************
Predict info:
precision:0.264
recall:0.226
f1-score:0.213

This article all code download: here

It seems that the accuracy rate is very low ... Let's use all the features ... The results are as follows:

*************************
Naive Bayes
*************************
Predict info:
precision:0.771
recall:0.770
f1-score:0.769
*************************
Knn
*************************
Predict info:
precision:0.652
recall:0.645
f1-score:0.645
*************************
Svm
*************************
Predict info:
precision:0.819
recall:0.816
f1-score:0.816
*************************
Kmeans
*************************
Predict info:
precision:0.289
recall:0.313
f1-score:0.266

More learning materials about Python will continue to be updated, so stay tuned for this blog and Sina Weibo Rachel Zhang.

Apply Scikit-learn to do text categorization

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More