Apply Scikit-learn to do text categorization

Source: Internet
Author: User
Tags svm pprint

http://blog.csdn.net/abcjennifer/article/details/23615947

Text mining paper did not find a unified benchmark, had to run their own procedures, passing through the predecessors if you know 20newsgroups or other useful public data set classification (preferably all class classification results, All or take part of the feature does not matter) trouble message to inform the benchmark now, million thanks!

Well, say the text. The 20newsgroups website gives 3 datasets, here we use the most primitive 20news-19997.tar.gz.

It is divided into the following processes:

    • Load a data set
    • Tim Feature
    • Classification
      • Naive Bayes
      • Knn
      • Svm
    • Clustering
Description: SciPy official online for reference, but look a bit messy, and there are bugs. In this article we look at the block. Environment:python 2.7 + Scipy (Scikit-learn) 1. Load Data SetDownload the dataset from 20news-19997.tar.gz, unzip it into the Scikit_learn_data folder, load the data, and see the code comment. [Python]View Plaincopy
  1. #first Extract the News_group dataset To/scikit_learn_data
  2. From sklearn.datasets import fetch_20newsgroups
  3. #all categories
  4. #newsgroup_train = fetch_20newsgroups (subset= ' train ')
  5. #part Categories
  6. Categories = [' Comp.graphics ',
  7. ' Comp.os.ms-windows.misc ',
  8. ' Comp.sys.ibm.pc.hardware ',
  9. ' Comp.sys.mac.hardware ',
  10. ' comp.windows.x '];
  11. Newsgroup_train = fetch_20newsgroups (subset = ' Train ', categories = categories);


You can check if the load is good: [Python]View Plaincopy
    1. #print category names
    2. From Pprint import Pprint
    3. Pprint (List (newsgroup_train.target_names))

Result: [' comp.graphics ',
' Comp.os.ms-windows.misc ',
' Comp.sys.ibm.pc.hardware ',
' Comp.sys.mac.hardware ',
' Comp.windows.x '] 2. Mention feature:Just load came in Newsgroup_train is an article document, we want to extract feature, that is, the word frequency ah God horse, with fit_transform Method 1. Hashingvectorizer, specify the number of feature [Python]View Plaincopy
  1. #newsgroup_train. Data is the original documents and we need to extract the
  2. #feature vectors inorder to model the text data
  3. From Sklearn.feature_extraction.text import Hashingvectorizer
  4. Vectorizer = Hashingvectorizer (stop_words = ' 中文版 ', non_negative = True,
  5. N_features = 10000)
  6. Fea_train = Vectorizer.fit_transform (newsgroup_train.data)
  7. Fea_test = Vectorizer.fit_transform (Newsgroups_test.data);
  8. #return feature vector ' Fea_train ' [n_samples,n_features]
  9. Print ' Size of Fea_train: ' + repr (fea_train.shape)
  10. Print ' Size of Fea_train: ' + repr (fea_test.shape)
  11. #11314 documents, 130107 vectors for all categories
  12. Print ' The average feature sparsity is {0:.3f}% '. Format (
  13. Fea_train.nnz/float (fea_train.shape[0]*fea_train.shape[1]) *100);

Result: Size of Fea_train: (2936, 10000)
Size of Fea_train: (1955, 10000)
The average feature sparsity is 1.002% because we only take 10,000 words, namely 10000 dimensional feature, the sparsity is not low.  In fact, with tfidfvectorizer statistics can get tens of thousands of feature, I counted the entire sample is 13w multidimensional, is a fairly sparse matrix. *************************************************************************************************************** ***********

The above code comment says that TF-IDF is different from the feature dimensions extracted on train and test, so how do you make them the same? There are two ways of doing this:

Method 2. Countvectorizer+tfidftransformer let two countvectorizer share vocabulary: [Python]View Plaincopy
  1. #----------------------------------------------------
  2. #method 1:countvectorizer+tfidftransformer
  3. Print ' *************************\ncountvectorizer+tfidftransformer\n************************* '
  4. From Sklearn.feature_extraction.text import Countvectorizer,tfidftransformer
  5. count_v1= Countvectorizer (stop_words = ' 中文版 ', MAX_DF = 0.5);
  6. Counts_train = Count_v1.fit_transform (Newsgroup_train.data);
  7. Print "The Shape of Train is" +repr (counts_train.shape)
  8. COUNT_V2 = Countvectorizer (Vocabulary=count_v1.vocabulary_);
  9. Counts_test = Count_v2.fit_transform (Newsgroups_test.data);
  10. Print "The shape of Test is" +repr (counts_test.shape)
  11. Tfidftransformer = Tfidftransformer ();
  12. Tfidf_train = Tfidftransformer.fit (Counts_train). Transform (Counts_train);
  13. Tfidf_test = Tfidftransformer.fit (counts_test). Transform (Counts_test);

Results: *************************countvectorizer+tfidftransformer
*************************
The shape of Train is (2936, 66433)
The shape of test is (1955, 66433) Method 3. Tfidfvectorizer let two tfidfvectorizer share vocabulary: [Python]View Plaincopy
  1. #method 2:tfidfvectorizer
  2. Print ' *************************\ntfidfvectorizer\n************************* '
  3. From Sklearn.feature_extraction.text import Tfidfvectorizer
  4. TV = Tfidfvectorizer (SUBLINEAR_TF = True,
  5. MAX_DF = 0.5,
  6. Stop_words = ' 中文版 ');
  7. Tfidf_train_2 = Tv.fit_transform (Newsgroup_train.data);
  8. TV2 = Tfidfvectorizer (vocabulary = tv.vocabulary_);
  9. Tfidf_test_2 = Tv2.fit_transform (Newsgroups_test.data);
  10. Print "The Shape of Train is" +repr (tfidf_train_2.shape)
  11. Print "The shape of Test is" +repr (tfidf_test_2.shape)
  12. Analyze = Tv.build_analyzer ()
  13. Tv.get_feature_names ()#statistical features/terms


Results: *************************
Tfidfvectorizer
*************************
The shape of Train is (2936, 66433)
The shape of test is (1955, 66433) In addition, there are sklearn in the package of the feature function, fetch_20newsgroups_vectorized Method 4. Fetch_20newsgroups_vectorized but this method can not pick out a few classes of feature, only all 20 classes of feature all out: [Python]View Plaincopy
  1. Print ' *************************\nfetch_20newsgroups_vectorized\n************************* '
  2. From sklearn.datasets import fetch_20newsgroups_vectorized
  3. Tfidf_train_3 = fetch_20newsgroups_vectorized (subset = ' train ');
  4. Tfidf_test_3 = fetch_20newsgroups_vectorized (subset = ' Test ');
  5. Print "The Shape of Train is" +repr (tfidf_train_3.data.shape)
  6. Print "The shape of Test is" +repr (tfidf_test_3.data.shape)


Results: *************************
Fetch_20newsgroups_vectorized
*************************
The shape of Train is (11314, 130107)
The shape of test is (7532, 130107) 3. Classification 3.1 multinomial Naive Bayes ClassifierSee code &comment, do not explain [Python]View Plaincopy
  1. ######################################################
  2. #Multinomial Naive Bayes Classifier
  3. Print ' *************************\nnaive bayes\n************************* '
  4. From Sklearn.naive_bayes import MULTINOMIALNB
  5. From Sklearn Import metrics
  6. Newsgroups_test = fetch_20newsgroups (subset = ' Test ',
  7. Categories = categories);
  8. Fea_test = Vectorizer.fit_transform (Newsgroups_test.data);
  9. #create the multinomial Naive Bayesian Classifier
  10. CLF = MULTINOMIALNB (alpha = 0.01)
  11. Clf.fit (Fea_train,newsgroup_train.target);
  12. pred = Clf.predict (fea_test);
  13. Calculate_result (newsgroups_test.target,pred);
  14. #notice Here we can see that f1_score are not equal to 2*precision*recall/(Precision+recall)
  15. #because the m_precision and M_recall we get is averaged, however, Metrics.f1_score () calculates
  16. #weithed average, i.e, takes into the number of each class into consideration.

Note my last 3 lines of comments, why f1≠2* (accuracy * recall rate)/(accuracy + recall rate)

Where the function Calculate_result computes F1:

[Python]View Plaincopy
  1. def calculate_result (actual,pred):
  2. M_precision = Metrics.precision_score (actual,pred);
  3. M_recall = Metrics.recall_score (actual,pred);
  4. print ' predict info: '
  5. print ' precision:{0:.3f} '. Format (m_precision)
  6. print ' recall:{0:0.3f} '. Format (M_recall);
  7. print ' f1-score:{0:.3f} '. Format (Metrics.f1_score (actual,pred));



3.2 KNN:

[Python]View Plaincopy
    1. ######################################################
    2. #KNN Classifier
    3. From sklearn.neighbors import kneighborsclassifier
    4. Print ' *************************\nknn\n************************* '
    5. KNNCLF = Kneighborsclassifier ()#default with k=5
    6. Knnclf.fit (Fea_train,newsgroup_train.target)
    7. pred = Knnclf.predict (fea_test);
    8. Calculate_result (newsgroups_test.target,pred);



3.3 SVM:

[CPP]View Plaincopy
    1. ######################################################
    2. #SVM Classifier
    3. From SKLEARN.SVM import SVC
    4. Print ' *************************\nsvm\n************************* '
    5. SVCLF = SVC (kernel = ' linear ') #default with ' RBF '
    6. Svclf.fit (Fea_train,newsgroup_train.target)
    7. pred = Svclf.predict (fea_test);
    8. Calculate_result (newsgroups_test.target,pred);

Results:

*************************

Naive Bayes
*************************
Predict info:
precision:0.764
recall:0.759
f1-score:0.760
*************************
Knn
*************************
Predict info:
precision:0.642
recall:0.635
f1-score:0.636
*************************
Svm
*************************
Predict info:
precision:0.777
recall:0.774
f1-score:0.774

4. Clustering

[CPP]View Plaincopy
    1. ######################################################
    2. #KMeans Cluster
    3. From Sklearn.cluster import Kmeans
    4. Print ' *************************\nkmeans\n************************* '
    5. pred = Kmeans (n_clusters=5)
    6. Pred.fit (Fea_test)
    7. Calculate_result (Newsgroups_test.target,pred.labels_);



Results:

*************************
Kmeans
*************************
Predict info:
precision:0.264
recall:0.226
f1-score:0.213

This article all code download: here

It seems that the accuracy rate is very low ... Let's use all the features ... The results are as follows:

*************************
Naive Bayes
*************************
Predict info:
precision:0.771
recall:0.770
f1-score:0.769
*************************
Knn
*************************
Predict info:
precision:0.652
recall:0.645
f1-score:0.645
*************************
Svm
*************************
Predict info:
precision:0.819
recall:0.816
f1-score:0.816
*************************
Kmeans
*************************
Predict info:
precision:0.289
recall:0.313
f1-score:0.266

More learning materials about Python will continue to be updated, so stay tuned for this blog and Sina Weibo Rachel Zhang.

Apply Scikit-learn to do text categorization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.