Sklearn for text categorization __ algorithm

Source: Internet
Author: User
Tags benchmark pprint

Text mining paper did not find a unified benchmark, had to run the program, passing through the predecessors if you know 20newsgroups or other useful common data set classification (preferably all class classification results, All or take part of the characteristics do not care about the trouble message to inform the benchmark now, Wan Xie.

Well, say the text. 20newsgroups 3 data sets are available on the website, where we use the most primitive 20news-19997.tar.gz.


is divided into the following processes:

Loading data collection feature classification Naive Bayes KNN SVM Clustering Description: scipy official online has a reference, but look a bit messy, and there are bugs. In this article, we divided the block to see.
Environment:python 2.7 + scipy (Scikit-learn)
1. Loading data sets
Download data set from 20news-19997.tar.gz, unzip to Scikit_learn_data folder, load data, see code comment.[Python]View Plain Copy #first extract the News_group dataset To/scikit_learn_data from sklearn.datasets import fetch_20n Ewsgroups #all Categories #newsgroup_train = fetch_20newsgroups (subset= ' train ') #part categories categories = [' Co Mp.graphics ', ' comp.os.ms-windows.misc ', ' comp.sys.ibm.pc.hardware ', ' comp.sys.mac.hardware ', ' comp.windows.x '   ]; Newsgroup_train = fetch_20newsgroups (subset = ' train ', categories = categories);

Can test whether the load is good:[Python]View Plain copy #print category names from Pprint import pprint pprint (List (newsgroup_train.target_names))
Result: [' comp.graphics ',
' Comp.os.ms-windows.misc ',
' Comp.sys.ibm.pc.hardware ',
' Comp.sys.mac.hardware ',
' Comp.windows.x ']







2. Tim Feature:Just load came in Newsgroup_train is a document, we want to extract feature, that is, the word frequency ah God horse, with fit_transform
Method 1. Hashingvectorizer, specify the number of feature
[Python]  View plain  copy   #newsgroup_train. data is the original documents, but  we need to extract the    #feature  vectors inorder to  model the text data   From sklearn.feature_extraction.text import  HashingVectorizer   Vectorizer = hashingvectorizer (stop_words =  ' 中文版 ', non_negative = true,                                    n_features = 10000)    Fea_train = vectorizer.fit_transform (newsgroup_ Train.data)    fea_test = vectorizer.fit_transform (newsgroups_test.data);          #return  feature vector  ' Fea_train '  [n_samples,n_features]   print  ' SIze of fea_train: '  + repr (fea_train.shape)    print  ' Size of fea_ Train: '  + repr (fea_test.shape)    #11314  documents, 130107 vectors for  all categories   print  ' the average feature sparsity is {0:. 3f}% '. Format (   fea_train.nnz/float (fea_train.shape[0]*fea_train.shape[1)) *

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.