Text mining paper did not find a unified benchmark, had to run the program, passing through the predecessors if you know 20newsgroups or other useful common data set classification (preferably all class classification results, All or take part of the characteristics do not care about the trouble message to inform the benchmark now, Wan Xie.
Well, say the text. 20newsgroups 3 data sets are available on the website, where we use the most primitive 20news-19997.tar.gz.
is divided into the following processes:
Loading data collection feature classification Naive Bayes KNN SVM Clustering Description: scipy official online has a reference, but look a bit messy, and there are bugs. In this article, we divided the block to see.
Environment:python 2.7 + scipy (Scikit-learn)
1. Loading data sets
Download data set from 20news-19997.tar.gz, unzip to Scikit_learn_data folder, load data, see code comment.[Python]View Plain Copy #first extract the News_group dataset To/scikit_learn_data from sklearn.datasets import fetch_20n Ewsgroups #all Categories #newsgroup_train = fetch_20newsgroups (subset= ' train ') #part categories categories = [' Co Mp.graphics ', ' comp.os.ms-windows.misc ', ' comp.sys.ibm.pc.hardware ', ' comp.sys.mac.hardware ', ' comp.windows.x ' ]; Newsgroup_train = fetch_20newsgroups (subset = ' train ', categories = categories);
Can test whether the load is good:[Python]View Plain copy #print category names from Pprint import pprint pprint (List (newsgroup_train.target_names))
Result: [' comp.graphics ',
' Comp.os.ms-windows.misc ',
' Comp.sys.ibm.pc.hardware ',
' Comp.sys.mac.hardware ',
' Comp.windows.x ']
2. Tim Feature:Just load came in Newsgroup_train is a document, we want to extract feature, that is, the word frequency ah God horse, with fit_transform
Method 1. Hashingvectorizer, specify the number of feature
[Python] View plain copy #newsgroup_train. data is the original documents, but we need to extract the #feature vectors inorder to model the text data From sklearn.feature_extraction.text import HashingVectorizer Vectorizer = hashingvectorizer (stop_words = ' 中文版 ', non_negative = true, n_features = 10000) Fea_train = vectorizer.fit_transform (newsgroup_ Train.data) fea_test = vectorizer.fit_transform (newsgroups_test.data); #return feature vector ' Fea_train ' [n_samples,n_features] print ' SIze of fea_train: ' + repr (fea_train.shape) print ' Size of fea_ Train: ' + repr (fea_test.shape) #11314 documents, 130107 vectors for all categories print ' the average feature sparsity is {0:. 3f}% '. Format ( fea_train.nnz/float (fea_train.shape[0]*fea_train.shape[1)) *