Apply Scikit-learn to do text categorization

Last Update:2018-07-25 Source: Internet

Author: User

Tags benchmark pprint

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Text mining paper did not find a unified benchmark, had to run their own procedures, passing through the predecessors if you know 20newsgroups or other useful public data set classification (preferably all class classification results, All or take part of the feature does not matter) trouble message to inform now benchmark, million Xie.

Well, say the text. The 20newsgroups website gives 3 datasets, here we use the most primitive 20news-19997.tar.gz.

It is divided into the following processes:

Load data Set feature classification Naive Bayes KNN SVM Cluster Description: SCIPY official online has a reference, but looked a bit messy, and there are bugs. In this article we look at the block.
Environment:python 2.7 + Scipy (scikit-learn)
1. Load DataSet
Download the dataset from 20news-19997.tar.gz, unzip it into the Scikit_learn_data folder, load the data, and see the code comment. [python] view plain copy print? #first extract the 20 news_group dataset to /scikit_learn_data from sklearn.datasets import fetch_20newsgroups #all categories # Newsgroup_train = fetch_20newsgroups (subset= ' train ') #part categories categories = [' comp.graphics ', ' comp.os.ms-windows.misc ', ' Comp.sys.ibm.pc.hardware ', ' comp.sys.mac.hardware ', ' comp.windows.x ']; Newsgroup_train = fetch_20newsgroups (subset = ' train ',categories = Categories);

#first Extract the News_group dataset To/scikit_learn_data from
sklearn.datasets import fetch_20newsgroups
# All categories
#newsgroup_train = fetch_20newsgroups (subset= ' train ')
#part categories
categories = [' Comp.graphics ',
 ' Comp.os.ms-windows.misc ',
 ' comp.sys.ibm.pc.hardware ',
 ' Comp.sys.mac.hardware ',
 ' comp.windows.x '];
Newsgroup_train = fetch_20newsgroups (subset = ' train ', categories = categories);

You can check if the load is good: [Python]View Plain copy print? #print category names from Pprint import pprint pprint (List (newsgroup_train.target_names))

#print category names from
pprint import pprint
pprint (List (newsgroup_train.target_names))

Result: [' comp.graphics ',
' Comp.os.ms-windows.misc ',
' Comp.sys.ibm.pc.hardware ',
' Comp.sys.mac.hardware ',
' Comp.windows.x ']

2. Mention feature:Just load came in Newsgroup_train is an article document, we want to extract feature, that is, the word frequency ah God horse, with fit_transform
Method 1. Hashingvectorizer, specify the number of feature
[Python] View Plain copy print? #newsgroup_train .data is the original documents, but we need to extract the #feature vectors inorder to model the text data from sklearn.feature_extraction.text import hashingvectorizer Vectorizer = hashingvectorizer (stop_words = ' 中文版 ',non_negative = true, n_features = 10000) Fea_train = vectorizer.fit_transform (newsgroup_train.data) fea_test = vectorizer.fit_transform (newsgroups_test.data); #return feature vector ' Fea_train ' [n_samples,n_features] print ' size of fea_train: ' + repr (fea_train.shape) print ' Size of fea_train: ' + repr (fea_test.shape) #11314 documents, 130107 vectors for all categories print ' the average feature sparsity is {0:.3f}% '. Format ( fea_train.nnz/float (fea_train.shape[0]*fea_train.shape[1]) *100);

#newsgroup_train. Data is the original documents and we need to extract the 
#feature vectors inorder to model the text Data from
sklearn.feature_extraction.text import hashingvectorizer
Vectorizer = Hashingvectorizer (stop_ Words = ' 中文版 ', non_negative = True,
                               n_features = 10000)
Fea_train = Vectorizer.fit_transform (newsgroup_ Train.data)
fea_test = Vectorizer.fit_transform (newsgroups_test.data);


#return feature vector ' Fea_train ' [n_samples,n_features]
print ' Size of Fea_train: ' + repr (fea_train.shape)
print ' Size of Fea_train: ' + repr (fea_test.shape)
#11314 documents, 130107 vectors for all categories
print ' the Average feature sparsity is {0:.3f}% '. Format (
fea_train.nnz/float (fea_train.shape[0]*fea_train.shape[1]) *100) ;

Result: Size of Fea_train: (2936, 10000)
Size of Fea_train: (1955, 10000)
The average feature sparsity is 1.002%
Because we only take 10,000 words, namely 10000 dimensional feature, sparse degree is not low. In fact, with tfidfvectorizer statistics can get tens of thousands of feature, I counted the entire sample is 13w multidimensional, is a fairly sparse matrix.

*************************************************************************************************************** ***********

The above code comment says that TF-IDF has different feature dimensions extracted on train and test, so how do you make them the same? There are two ways of doing this:

Method 2. Countvectorizer+tfidftransformer

Let two countvectorizer share the vocabulary:[Python] View Plain copy print? #---------------------------------------------------- #method 1:countvectorizer+ tfidftransformer print ' *************************\ncountvectorizer+tfidftransformer\n********* ' From sklearn.feature_extraction.text import countvectorizer, tfidftransformer Count_v1= countvectorizer (stop_words = ' 中文版 ', max_df = 0.5); counts_train = count_v1.fit_transform (newsgroup_train.data); print "the shape of train is " +repr (counts_train.shape) Count_v2 = countvectorizer (vocabulary=count_v1.vocabulary_); counts_test = Count_v2.fit_transform (newsgroups_test.data); print "The shape of test is "+repr (counts_test.shape) tfidftransformer = tfidftransformer (); & NBsP Tfidf_train = tfidftransformer.fit (Counts_train) transform (counts_train); Tfidf_test = tfidftransformer.fit (counts_test). Transform (counts_test);

#----------------------------------------------------
#method 1:countvectorizer+tfidftransformer
print ' \ncountvectorizer+tfidftransformer\n************************* ' from
Sklearn.feature_extraction.text import Countvectorizer,tfidftransformer
count_v1= countvectorizer (stop_words = ' 中文版 ', MAX_DF = 0.5);
Counts_train = Count_v1.fit_transform (newsgroup_train.data);
Print "The shape of Train is" +repr (counts_train.shape)

count_v2 = Countvectorizer (vocabulary=count_v1.vocabulary_ );
Counts_test = Count_v2.fit_transform (newsgroups_test.data);
Print "The shape of Test is" +repr (counts_test.shape)

Tfidftransformer = Tfidftransformer ();

Tfidf_train = Tfidftransformer.fit (Counts_train). Transform (Counts_train);
Tfidf_test = Tfidftransformer.fit (counts_test). Transform (Counts_test);

Results: ************************* Countvectorizer+tfidftransformer
*************************
The shape of Train is (2936, 66433)
The shape of test is (1955, 66433)

Method 3. Tfidfvectorizer

Let two tfidfvectorizer share the vocabulary: [Python] View Plain copy print? #method 2:TfidfVectorizer print ' *************************\ntfidfvectorizer\n************ ' from sklearn.feature_extraction.text import tfidfvectorizer Tv = tfidfvectorizer (sublinear_tf = true, max_df = 0.5, stop_words = ' 中文版 '); tfidf_train_2 = tv.fit_transform (newsgroup_ Train.data); Tv2 = tfidfvectorizer (vocabulary = tv.vocabulary_); Tfidf_test_2 = tv2.fit_transform (newsgroups_test.data); print "the shape of train is "+repr (tfidf_train_2.shape) print " The shape of test is "+repr (tfidf_test_2.shape) Analyze = tv.build_analyzer () tv.get_ Feature_names () #statistical features/terms

#method 2:tfidfvectorizer
print ' *************************\ntfidfvectorizer\n************************* '
From Sklearn.feature_extraction.text import tfidfvectorizer
TV = Tfidfvectorizer (SUBLINEAR_TF = True,
                                    max_df = 0.5,
                                    stop_words = ' 中文版 ');
Tfidf_train_2 = Tv.fit_transform (newsgroup_train.data);
TV2 = Tfidfvectorizer (vocabulary = tv.vocabulary_);
Tfidf_test_2 = Tv2.fit_transform (newsgroups_test.data);
Print "The shape of Train is" +repr (tfidf_train_2.shape)
print "The shape of Test is" +repr (tfidf_test_2.shape)
A Nalyze = Tv.build_analyzer ()
tv.get_feature_names () #statistical features/terms

Results: *************************
Tfidfvectorizer
*************************
The shape of Train is (2936, 66433)
The shape of test is (1955, 66433)
In addition, there are sklearn feature functions in the package, fetch_20newsgroups_vectorized

Method 4. Fetch_20newsgroups_vectorized
But this method can not pick out a few classes of feature, only all 20 classes of feature all out:
[Python]View Plain copy print? Print

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More