Tag:scikit-learn text clustering
# -*- coding=utf-8 -*-"" "Text category" "" From sklearn.datasets import fetch _20newsgroupsfrom sklearn.feature_extraction.text import countvectorizerfrom Sklearn.feature_extraction.text import tfidftransformerfrom sklearn.naive_bayes import multinomialnbcategories = [' alt.atheism ', ' Soc.religion.christian ', ' comp.graphics ', ' sci.med ']twenty_train = fetch_20newsgroups (subset= ' train ', categories=categories, shuffle=true, random_state=42) Print len (twenty_train.data) len (twenty_train.filenames) Count_vect = countvectorizer () x_train_counts = count_vect.fit_transform (twenty_train.data) print x_train_counts.shapeprint count_vect.vocabulary_.get (' algorithm ') tf_transformer = Tfidftransformer (Use_idf=false). Fit (x_train_counts) x_train_tf = tf_transformer.transform (X_train_ Counts) print x_train_tf.shapetfidf_transforMer = tfidftransformer () x_train_tfidf = tf_transformer.fit_transform (X_train_counts) print  X_TRAIN_TFIDF.SHAPECLF = MULTINOMIALNB (). Fit (X_train_tfidf, twenty_train.target) docs_new = [' god is love ', ' opengl on the gpu is fast ']X_new_counts = count_vect.transform (docs_new) x_new_tfidf = tfidf_transformer.fit_transform (X_new_ Counts) predicted = clf.predict (X_NEW_TFIDF) for doc, category in zip (Docs_new, predicted): print '%r=>%s ' % (doc, twenty_train.target_ Names[category]
Categorize 2,257 of documents in Fetch_20newsgroups
Count the occurrences of each word
With TF-IDF statistics, TF is the number of occurrences of each word in a document divided by the total number of words in the document, IDF is the total number of documents divided by the number of documents containing the word, and then the logarithm; TF * IDF is the value used here, the larger the value, the more important the word, or the more relevant.
Example Concrete procedure:
The number of occurrences of each word is calculated first
Then calculates the TF-IDF value
and bring it into the model for training.
Finally, two new document types are predicted
Results:
' God is love ' = ' Soc.religion.christian ' OpenGL on the GPU is fast ' = Comp.graphics
"Learning Notes" Scikit-learn text clustering instances