Text mining paper did not find a unified benchmark, had to run their own procedures, passing through the predecessors if you know 20newsgroups or other useful public data set classification (preferably all class classification results, All or take part of the feature does not matter) trouble message to inform now benchmark, million Xie.
Well, say the text. The 20newsgroups website gives 3 datasets, here we use the most primitive 20news-19997.tar.gz.
It is divided into the following processes:
Load data Set feature classification Naive Bayes KNN SVM Cluster Description: SCIPY official online has a reference, but looked a bit messy, and there are bugs. In this article we look at the block.
Environment:python 2.7 + Scipy (scikit-learn)
1. Load DataSet
Download the dataset from 20news-19997.tar.gz, unzip it into the Scikit_learn_data folder, load the data, and see the code comment. [python] view plain copy print? #first extract the 20 news_group dataset to /scikit_learn_data from sklearn.datasets import fetch_20newsgroups #all categories # Newsgroup_train = fetch_20newsgroups (subset= ' train ') #part categories categories = [' comp.graphics ', ' comp.os.ms-windows.misc ', ' Comp.sys.ibm.pc.hardware ', ' comp.sys.mac.hardware ', ' comp.windows.x ']; Newsgroup_train = fetch_20newsgroups (subset = ' train ',categories = Categories);
#first Extract the News_group dataset To/scikit_learn_data from
sklearn.datasets import fetch_20newsgroups
# All categories
#newsgroup_train = fetch_20newsgroups (subset= ' train ')
#part categories
categories = [' Comp.graphics ',
' Comp.os.ms-windows.misc ',
' comp.sys.ibm.pc.hardware ',
' Comp.sys.mac.hardware ',
' comp.windows.x '];
Newsgroup_train = fetch_20newsgroups (subset = ' train ', categories = categories);
You can check if the load is good:
[Python]View Plain copy print? #print category names from Pprint import pprint pprint (List (newsgroup_train.target_names))
#print category names from
pprint import pprint
pprint (List (newsgroup_train.target_names))
Result: [' comp.graphics ',
' Comp.os.ms-windows.misc ',
' Comp.sys.ibm.pc.hardware ',
' Comp.sys.mac.hardware ',
' Comp.windows.x ']
2. Mention feature:Just load came in Newsgroup_train is an article document, we want to extract feature, that is, the word frequency ah God horse, with fit_transform
Method 1. Hashingvectorizer, specify the number of feature
[Python] View Plain copy print? #newsgroup_train .data is the original documents, but we need to extract the #feature vectors inorder to model the text data from sklearn.feature_extraction.text import hashingvectorizer Vectorizer = hashingvectorizer (stop_words = ' 中文版 ',non_negative = true, n_features = 10000) Fea_train = vectorizer.fit_transform (newsgroup_train.data) fea_test = vectorizer.fit_transform (newsgroups_test.data); #return feature vector ' Fea_train ' [n_samples,n_features] print ' size of fea_train: ' + repr (fea_train.shape) print ' Size of fea_train: ' + repr (fea_test.shape) #11314 documents, 130107 vectors for all categories print ' the average feature sparsity is {0:.3f}% '. Format ( fea_train.nnz/float (fea_train.shape[0]*fea_train.shape[1]) *100);
#newsgroup_train. Data is the original documents and we need to extract the
#feature vectors inorder to model the text Data from
sklearn.feature_extraction.text import hashingvectorizer
Vectorizer = Hashingvectorizer (stop_ Words = ' 中文版 ', non_negative = True,
n_features = 10000)
Fea_train = Vectorizer.fit_transform (newsgroup_ Train.data)
fea_test = Vectorizer.fit_transform (newsgroups_test.data);
#return feature vector ' Fea_train ' [n_samples,n_features]
print ' Size of Fea_train: ' + repr (fea_train.shape)
print ' Size of Fea_train: ' + repr (fea_test.shape)
#11314 documents, 130107 vectors for all categories
print ' the Average feature sparsity is {0:.3f}% '. Format (
fea_train.nnz/float (fea_train.shape[0]*fea_train.shape[1]) *100) ;
Result: Size of Fea_train: (2936, 10000)
Size of Fea_train: (1955, 10000)
The average feature sparsity is 1.002%
Because we only take 10,000 words, namely 10000 dimensional feature, sparse degree is not low. In fact, with tfidfvectorizer statistics can get tens of thousands of feature, I counted the entire sample is 13w multidimensional, is a fairly sparse matrix.
*************************************************************************************************************** ***********
The above code comment says that TF-IDF has different feature dimensions extracted on train and test, so how do you make them the same? There are two ways of doing this:
Method 2. Countvectorizer+tfidftransformer
Let two countvectorizer share the vocabulary:[Python] View Plain copy print? #---------------------------------------------------- #method 1:countvectorizer+ tfidftransformer print ' *************************\ncountvectorizer+tfidftransformer\n********* ' From sklearn.feature_extraction.text import countvectorizer, tfidftransformer Count_v1= countvectorizer (stop_words = ' 中文版 ', max_df = 0.5); counts_train = count_v1.fit_transform (newsgroup_train.data); print "the shape of train is " +repr (counts_train.shape) Count_v2 = countvectorizer (vocabulary=count_v1.vocabulary_); counts_test = Count_v2.fit_transform (newsgroups_test.data); print "The shape of test is "+repr (counts_test.shape) tfidftransformer = tfidftransformer (); & NBsP Tfidf_train = tfidftransformer.fit (Counts_train) transform (counts_train); Tfidf_test = tfidftransformer.fit (counts_test). Transform (counts_test);
#----------------------------------------------------
#method 1:countvectorizer+tfidftransformer
print ' \ncountvectorizer+tfidftransformer\n************************* ' from
Sklearn.feature_extraction.text import Countvectorizer,tfidftransformer
count_v1= countvectorizer (stop_words = ' 中文版 ', MAX_DF = 0.5);
Counts_train = Count_v1.fit_transform (newsgroup_train.data);
Print "The shape of Train is" +repr (counts_train.shape)
count_v2 = Countvectorizer (vocabulary=count_v1.vocabulary_ );
Counts_test = Count_v2.fit_transform (newsgroups_test.data);
Print "The shape of Test is" +repr (counts_test.shape)
Tfidftransformer = Tfidftransformer ();
Tfidf_train = Tfidftransformer.fit (Counts_train). Transform (Counts_train);
Tfidf_test = Tfidftransformer.fit (counts_test). Transform (Counts_test);
Results: ************************* Countvectorizer+tfidftransformer
*************************
The shape of Train is (2936, 66433)
The shape of test is (1955, 66433)
Method 3. Tfidfvectorizer
Let two tfidfvectorizer share the vocabulary:
[Python] View Plain copy print? #method 2:TfidfVectorizer print ' *************************\ntfidfvectorizer\n************ ' from sklearn.feature_extraction.text import tfidfvectorizer Tv = tfidfvectorizer (sublinear_tf = true, max_df = 0.5, stop_words = ' 中文版 '); tfidf_train_2 = tv.fit_transform (newsgroup_ Train.data); Tv2 = tfidfvectorizer (vocabulary = tv.vocabulary_); Tfidf_test_2 = tv2.fit_transform (newsgroups_test.data); print "the shape of train is "+repr (tfidf_train_2.shape) print " The shape of test is "+repr (tfidf_test_2.shape) Analyze = tv.build_analyzer () tv.get_ Feature_names () #statistical features/terms
#method 2:tfidfvectorizer
print ' *************************\ntfidfvectorizer\n************************* '
From Sklearn.feature_extraction.text import tfidfvectorizer
TV = Tfidfvectorizer (SUBLINEAR_TF = True,
max_df = 0.5,
stop_words = ' 中文版 ');
Tfidf_train_2 = Tv.fit_transform (newsgroup_train.data);
TV2 = Tfidfvectorizer (vocabulary = tv.vocabulary_);
Tfidf_test_2 = Tv2.fit_transform (newsgroups_test.data);
Print "The shape of Train is" +repr (tfidf_train_2.shape)
print "The shape of Test is" +repr (tfidf_test_2.shape)
A Nalyze = Tv.build_analyzer ()
tv.get_feature_names () #statistical features/terms
Results: *************************
Tfidfvectorizer
*************************
The shape of Train is (2936, 66433)
The shape of test is (1955, 66433)
In addition, there are sklearn feature functions in the package, fetch_20newsgroups_vectorized
Method 4. Fetch_20newsgroups_vectorized
But this method can not pick out a few classes of feature, only all 20 classes of feature all out:
[Python]View Plain copy print? Print