"Learning Notes" Scikit-learn text clustering instances

Source: Internet
Author: User
Tags idf

Tag:scikit-learn    text clustering    

# -*- coding=utf-8 -*-"" "Text category" "" From sklearn.datasets import fetch _20newsgroupsfrom sklearn.feature_extraction.text import countvectorizerfrom  Sklearn.feature_extraction.text import tfidftransformerfrom sklearn.naive_bayes import  multinomialnbcategories = [' alt.atheism ',  ' Soc.religion.christian ',  ' comp.graphics ',   ' sci.med ']twenty_train = fetch_20newsgroups (subset= ' train ', categories=categories,  shuffle=true, random_state=42) Print len (twenty_train.data) len (twenty_train.filenames) Count_vect  = countvectorizer () x_train_counts = count_vect.fit_transform (twenty_train.data) print  x_train_counts.shapeprint count_vect.vocabulary_.get (' algorithm ') tf_transformer =  Tfidftransformer (Use_idf=false). Fit (x_train_counts) x_train_tf = tf_transformer.transform (X_train_ Counts) print x_train_tf.shapetfidf_transforMer = tfidftransformer () x_train_tfidf = tf_transformer.fit_transform (X_train_counts) print  X_TRAIN_TFIDF.SHAPECLF = MULTINOMIALNB (). Fit (X_train_tfidf, twenty_train.target) docs_new  = [' god is love ',  ' opengl on the gpu is fast ']X_new_counts  = count_vect.transform (docs_new) x_new_tfidf = tfidf_transformer.fit_transform (X_new_ Counts) predicted = clf.predict (X_NEW_TFIDF) for doc, category in zip (Docs_new,  predicted):    print  '%r=>%s '  %  (doc, twenty_train.target_ Names[category]


Categorize 2,257 of documents in Fetch_20newsgroups

    1. Count the occurrences of each word

    2. With TF-IDF statistics, TF is the number of occurrences of each word in a document divided by the total number of words in the document, IDF is the total number of documents divided by the number of documents containing the word, and then the logarithm; TF * IDF is the value used here, the larger the value, the more important the word, or the more relevant.

Example Concrete procedure:

    1. The number of occurrences of each word is calculated first

    2. Then calculates the TF-IDF value

    3. and bring it into the model for training.

    4. Finally, two new document types are predicted

Results:

' God is love ' = ' Soc.religion.christian ' OpenGL on the GPU is fast ' = Comp.graphics




"Learning Notes" Scikit-learn text clustering instances

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.