Apply Scikit-learn to do text categorization

Source: Internet
Author: User
Tags benchmark pprint


Text mining paper did not find a unified benchmark, had to run their own procedures, passing through the predecessors if you know 20newsgroups or other useful public data set classification (preferably all class classification results, All or take part of the feature does not matter) trouble message to inform now benchmark, million Xie.

Well, say the text. The 20newsgroups website gives 3 datasets, here we use the most primitive 20news-19997.tar.gz.


It is divided into the following processes:

Load data Set feature classification Naive Bayes KNN SVM Cluster Description: SCIPY official online has a reference, but looked a bit messy, and there are bugs. In this article we look at the block.
Environment:python 2.7 + Scipy (scikit-learn)
1. Load DataSet
Download the dataset from 20news-19997.tar.gz, unzip it into the Scikit_learn_data folder, load the data, and see the code comment. [python] view plain copy print? #first  extract the 20 news_group dataset to /scikit_learn_data   from sklearn.datasets import fetch_20newsgroups   #all  categories   # Newsgroup_train = fetch_20newsgroups (subset= ' train ')    #part  categories   categories = [' comp.graphics ',     ' comp.os.ms-windows.misc ',     ' Comp.sys.ibm.pc.hardware ',     ' comp.sys.mac.hardware ',     ' comp.windows.x '];    Newsgroup_train = fetch_20newsgroups (subset =  ' train ',categories =  Categories);  

#first Extract the News_group dataset To/scikit_learn_data from
sklearn.datasets import fetch_20newsgroups
# All categories
#newsgroup_train = fetch_20newsgroups (subset= ' train ')
#part categories
categories = [' Comp.graphics ',
 ' Comp.os.ms-windows.misc ',
 ' comp.sys.ibm.pc.hardware ',
 ' Comp.sys.mac.hardware ',
 ' comp.windows.x '];
Newsgroup_train = fetch_20newsgroups (subset = ' train ', categories = categories);


You can check if the load is good: [Python]View Plain copy print? #print category names from Pprint import pprint pprint (List (newsgroup_train.target_names))
#print category names from
pprint import pprint
pprint (List (newsgroup_train.target_names))

Result: [' comp.graphics ',
' Comp.os.ms-windows.misc ',
' Comp.sys.ibm.pc.hardware ',
' Comp.sys.mac.hardware ',
' Comp.windows.x ']

2. Mention feature:Just load came in Newsgroup_train is an article document, we want to extract feature, that is, the word frequency ah God horse, with fit_transform
Method 1. Hashingvectorizer, specify the number of feature
[Python] View Plain copy print? #newsgroup_train .data is the original documents, but we need to  extract the    #feature  vectors inorder to model the text  data   from sklearn.feature_extraction.text import hashingvectorizer   Vectorizer = hashingvectorizer (stop_words =  ' 中文版 ',non_negative = true,                                    n_features =  10000)    Fea_train = vectorizer.fit_transform (newsgroup_train.data)    fea_test  = vectorizer.fit_transform (newsgroups_test.data);         #return   feature vector  ' Fea_train '  [n_samples,n_features]   print  ' size of fea_train: '  + repr (fea_train.shape)    print  ' Size of fea_train: '   + repr (fea_test.shape)    #11314  documents, 130107 vectors for all  categories   print  ' the average feature sparsity is {0:.3f}% '. Format (    fea_train.nnz/float (fea_train.shape[0]*fea_train.shape[1]) *100);  
#newsgroup_train. Data is the original documents and we need to extract the 
#feature vectors inorder to model the text Data from
sklearn.feature_extraction.text import hashingvectorizer
Vectorizer = Hashingvectorizer (stop_ Words = ' 中文版 ', non_negative = True,
                               n_features = 10000)
Fea_train = Vectorizer.fit_transform (newsgroup_ Train.data)
fea_test = Vectorizer.fit_transform (newsgroups_test.data);


#return feature vector ' Fea_train ' [n_samples,n_features]
print ' Size of Fea_train: ' + repr (fea_train.shape)
print ' Size of Fea_train: ' + repr (fea_test.shape)
#11314 documents, 130107 vectors for all categories
print ' the Average feature sparsity is {0:.3f}% '. Format (
fea_train.nnz/float (fea_train.shape[0]*fea_train.shape[1]) *100) ;

Result: Size of Fea_train: (2936, 10000)
Size of Fea_train: (1955, 10000)
The average feature sparsity is 1.002%
Because we only take 10,000 words, namely 10000 dimensional feature, sparse degree is not low. In fact, with tfidfvectorizer statistics can get tens of thousands of feature, I counted the entire sample is 13w multidimensional, is a fairly sparse matrix.

*************************************************************************************************************** ***********

The above code comment says that TF-IDF has different feature dimensions extracted on train and test, so how do you make them the same? There are two ways of doing this:



Method 2. Countvectorizer+tfidftransformer

Let two countvectorizer share the vocabulary:[Python] View Plain copy print? #----------------------------------------------------   #method  1:countvectorizer+ tfidftransformer   print  ' *************************\ncountvectorizer+tfidftransformer\n********* '    From sklearn.feature_extraction.text import countvectorizer, tfidftransformer   Count_v1= countvectorizer (stop_words =  ' 中文版 ', max_df  = 0.5);   counts_train = count_v1.fit_transform (newsgroup_train.data);   print  "the shape of train is " +repr (counts_train.shape)       Count_v2 = countvectorizer (vocabulary=count_v1.vocabulary_);   counts_test =  Count_v2.fit_transform (newsgroups_test.data);   print  "The shape of test is   "+repr (counts_test.shape)       tfidftransformer = tfidftransformer ();  & NBsP    Tfidf_train = tfidftransformer.fit (Counts_train) transform (counts_train);   Tfidf_test = tfidftransformer.fit (counts_test). Transform (counts_test);  

#----------------------------------------------------
#method 1:countvectorizer+tfidftransformer
print ' \ncountvectorizer+tfidftransformer\n************************* ' from
Sklearn.feature_extraction.text import Countvectorizer,tfidftransformer
count_v1= countvectorizer (stop_words = ' 中文版 ', MAX_DF = 0.5);
Counts_train = Count_v1.fit_transform (newsgroup_train.data);
Print "The shape of Train is" +repr (counts_train.shape)

count_v2 = Countvectorizer (vocabulary=count_v1.vocabulary_ );
Counts_test = Count_v2.fit_transform (newsgroups_test.data);
Print "The shape of Test is" +repr (counts_test.shape)

Tfidftransformer = Tfidftransformer ();

Tfidf_train = Tfidftransformer.fit (Counts_train). Transform (Counts_train);
Tfidf_test = Tfidftransformer.fit (counts_test). Transform (Counts_test);

Results: ************************* Countvectorizer+tfidftransformer
*************************
The shape of Train is (2936, 66433)
The shape of test is (1955, 66433)




Method 3. Tfidfvectorizer

Let two tfidfvectorizer share the vocabulary: [Python] View Plain copy print? #method  2:TfidfVectorizer   print  ' *************************\ntfidfvectorizer\n************ '    from sklearn.feature_extraction.text import tfidfvectorizer    Tv = tfidfvectorizer (sublinear_tf = true,                                         max_df = 0.5,                                          stop_words =  ' 中文版 ');   tfidf_train_2 = tv.fit_transform (newsgroup_ Train.data);   Tv2 = tfidfvectorizer (vocabulary = tv.vocabulary_);   Tfidf_test_2 = tv2.fit_transform (newsgroups_test.data);   print  "the shape  of train is  "+repr (tfidf_train_2.shape)    print " The shape of test  is  "+repr (tfidf_test_2.shape)    Analyze = tv.build_analyzer ()    tv.get_ Feature_names () #statistical  features/terms  
#method 2:tfidfvectorizer
print ' *************************\ntfidfvectorizer\n************************* '
From Sklearn.feature_extraction.text import tfidfvectorizer
TV = Tfidfvectorizer (SUBLINEAR_TF = True,
                                    max_df = 0.5,
                                    stop_words = ' 中文版 ');
Tfidf_train_2 = Tv.fit_transform (newsgroup_train.data);
TV2 = Tfidfvectorizer (vocabulary = tv.vocabulary_);
Tfidf_test_2 = Tv2.fit_transform (newsgroups_test.data);
Print "The shape of Train is" +repr (tfidf_train_2.shape)
print "The shape of Test is" +repr (tfidf_test_2.shape)
A Nalyze = Tv.build_analyzer ()
tv.get_feature_names () #statistical features/terms


Results: *************************
Tfidfvectorizer
*************************
The shape of Train is (2936, 66433)
The shape of test is (1955, 66433)
In addition, there are sklearn feature functions in the package, fetch_20newsgroups_vectorized



Method 4. Fetch_20newsgroups_vectorized
But this method can not pick out a few classes of feature, only all 20 classes of feature all out:
[Python]View Plain copy print? Print
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.