The application of machine learning system design Scikit-learn do text classification (top)

Source: Internet
Author: User
Tags idf nltk

Objective:

This series is in the author's study "Machine Learning System Design" ([Beauty] willirichert) process of thinking and practice, the book through Python from data processing, to feature engineering, to model selection, the machine learning problem solving process one by one presented. The source code and data set designed in the book have been uploaded to my resources: http://download.csdn.net/detail/solomon1558/8971649

The 3rd chapter realizes the matching of the relevant text by the +k mean clustering of the word bag model. This paper mainly explains the content of text preprocessing, which involves cutting a penny, data cleaning, calculating TF-IDF value and so on.

1. Statistical terms

Experiment with a simple data set that consists of 5 documents:

A. txt-a toy post about machine learning. Actually, it contains not much interesting stuff.

Databases provide storagecapabilities. txt Imaging.

The. txt most imaging databases safe imagespermanently.

. txt Imaging databases store data.

. txt Imaging databases store data. Imagingdatabases store data. Imaging databases store data.

In this documentation dataset, we want to find the document that is closest to the document "Imaging database". In order to convert the original text to feature data that can be used by the PLA class algorithm, you first need to use the word bag (Bag-of-word) method to measure the similarity between the text and eventually generate the eigenvector for each text.

The word-bag method is based on the simple frequency statistics; statistics of the frequency of each post, expressed as a vector, that is, vectorization. Scikit-learn's Countvectorizer can do the work of statistical words efficiently, scikit functions and classes can be introduced through the Sklearn package:

posts = [Open (Os.path.join (diros.listdir (dir)]vectorizer = Countvectorizer (min_df=1 stop_words= "中文版") X_train = Vectorizer.fit_transform (posts) 

Assuming that the text to be trained is stored in the directory dir, we pass the data set to Countvectorizer. The parameter MIN_DF determines how Countervectorizer deals with words that are infrequently used (minimum document frequency). When MIN_DF is an integer, all words that appear less than this value will be discarded, and when it is a scale, the words that appear in the entire data set that are less than this value will be dropped.

We need to tell this information that we want to quantify the entire data set of the processor so that it can know in advance what the words are:

X_train = vectorizer.fit_transform (posts) num_samplesx_train. Shape(% (Num_samplesnum_features)) Print (Vectorizer.get_feature_names ())

The output of the program is as follows, and 5 documents contain 25 words

#sample: 5, #feature: 25

[u ' about ', u ' actually ', U ' capabilities ', U ' contains ', U ' data ', U ' databases ', U ' images ', U ' imaging ', U ' Interesting ', u ' are ', u ' it ', u ' learning ', U ' machine ', u ' very ', u ' much ', u ' not ', U ' permanently ', u ' post ', U ' provide ', U ' Safe ', U ' storage ', U ' store ', u ' stuff ', u ' this ', U ' toy '

To quantify a new document:

#a New Post "Imaging Databases"New_post_vec = Vectorizer.transform ([new_post])

The frequency array of each sample is calculated as a vector similarity, and all elements of the array need to be used [using the member function ToArray ()]. The norm () function calculates the Euclidean norm (minimum distance) of the new document and all the vector of the training document, thus measuring the similarity between them.

#-------Calculate raw distances betwee new and old posts and record the shortest one-------------------------defDist_raw(v1, V2): Delta = v1-v2returnSp.linalg.norm (Delta.toarray ()) Best_doc =NoneBest_dist = Sys.maxintbest_i =None forIinchRange(0, Num_samples): Post = Posts[i]ifPost = = New_post:Continue    Post_vec = X_train.getrow (i) d = Dist_raw (Post_vec, NEW_POST_VEC)Print"= = = Post%i with dist =%.2f:%s"% (i, D, PostifD<best_dist:best_dist = d best_i = iPrint("Best post was%i with dist=%.2f"% (best_i, Best_dist))

= = = Post 0 with dist = 4.00: The IS-a toy post about machine learning. Actually, it contains not muchinteresting stuff.

= = = Post 1 with dist =1.73: Imaging databases provide storage capabilities.

= = = Post 2 with dist =2.00: Most imaging databases safe images permanently.

= = = Post 3 with dist =1.41: Imaging databases store data.

= = = Post 4 with dist =5.10: Imaging databases store data. Imaging databases store data. Imaging databasesstore data.

Best post are 3 with dist=1.41

The results show that document 3 is most similar to the new document. However, the contents of document 4 and document 3 are the same, but they are repeated 3 times. Therefore, it should be similar to the new document as in document 3.

#-------Case study:why Post 4 and post 5 different?----------- Print (X_train.getrow (3). ToArray ()) Print (X_train.getrow (4). ToArray ())

[[0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]]

[[0 0 0 0 3 3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0]]

2. Text Processing 2.1 Word frequency vector normalization

The Dist_raw function in section 2nd is extended to calculate the distance between vectors on normalized vectors (the vector components divided by their modulus length).

Dist_norm (V1v2):    v1_normalized = V1/sp.linalg.norm (V1.toarray ())    v2_normalized = V2/sp.linalg.norm ( V2.toarray ())    delta = v1_normalized-v2_normalized    sp.linalg.norm (Delta.toarray ())

= = = Post 0 with dist = 1.41:this is a toy post aboutmachine learning. Actually, it contains not much interesting stuff.

= = = Post 1 with dist = 0.86:imaging databases providestorage capabilities.

= = = Post 2 with dist = 0.92:most imaging Databasessafe images permanently.

= = = Post 3 with dist = 0.77: Imagingdatabases store data.

= = = Post 4 with dist = 0.77: Imaging databases store data. Imaging databases store data. Imaging databasesstore data.

Best post are 3 with dist=0.77

After the normalization of the word frequency vector, document 3 and document 4 have the same similarity to the new document. From the point of view of frequency statistics, this is more correct.

2.2 Excluding discontinued words

The words "the" and "of" in the text often appear in a variety of different texts, which are called deactivation words. Removing a stop word is a common step in text processing because the stop Word does not help to differentiate text. There is a simple parameter stop_words () in the Countvectorizer to complete the task:

Vectorizer = Countvectorizer (min_df=1stop_words=' 中文版 ')

2.3 Stem Processing

In order to put together semantically similar but different forms of words, we need a function to classify the words into a specific stem form. The Natural Language Processing Toolkit (NLTK) provides a very easy-to-embed STEM processor that is embedded in the Countvectorizer.

We need to stem the documents before they are passed into the countvectorizer. The class provides several hooks that can be used to customize the operations of the preprocessing and tokenization phases. The preprocessor and the word divider can be passed in as parameters to the constructor. We don't want to put the stemming in any of them, because then we need to cut and normalization the words in person. Instead, we can rewrite the Build_analyzer method to achieve:

ImportNltk.stemenglish_stemmer = Nltk.stem.SnowballStemmer (' 中文版 ')classStemmedcountvectorizer(Countvectorizer):defBuild_analyzer( Self): Analyzer =Super(Stemmedcountvectorizer,  Self). Build_analyzer ()return LambdaDoc: (English_stemmer.stem (W) forWinchAnalyzer (DOC)) Vectorizer = Stemmedcountvectorizer (MIN_DF=1, Stop_words=' 中文版 ')

Follow the steps below to process each post:

(1) In the preprocessing phase, the original document becomes lowercase (this is done in the parent class);

(2) Extracting all the words at the word segmentation stage;

(3) Convert each word into a stem form.

3. Calculate TF-IDF

At this point, we use statistical terms to extract the compact eigenvector from the noisy text. The values of these characteristics are the number of occurrences of the corresponding words in all training texts, and our default larger eigenvalues mean that qualifying words are more important to the text. But in the training text, different words contribute more to the distinction of text.

This needs to be solved by counting the word frequency of each text and discounting the weight of words appearing in multiple texts. That is, when a word often appears in some particular text, and when it is seldom seen elsewhere, it should be given a greater weight.

This is exactly what the word frequency-reverse document frequencies (TF-IDF) does: TF represents the Statistics section, and IDF takes the weight discount into account. A simple implementation is as follows:

ImportSciPy asSpdefTFIDF(t, D, D): TF =float(D.count (t))/sum(D.count (W) forWinchSet(d) IDF = Sp.log (float(Len(D))/(Len([doc forDocinchDifTinchDOC])))returnTF * IDF

Scikit-learn has encapsulated the algorithm into Tfidfvectorizer (inherited from Countvectorizer) in the actual application process. After doing this, the document vectors we get will no longer contain the word crowding value, but rather the TF-IDF value of each word.

Code Listing:

Import Osimport sysimport scipy as spfrom sklearn.feature_extraction.text import Countvectorizerdir = r "... /data/toy "posts = [Open (Os.path.join (dir, f)). Read () for F in Os.listdir (dir)]new_post =" Imaging databases "import NLTK.S Temenglish_stemmer = Nltk.stem.SnowballStemmer (' 中文版 ') class Stemmedcountvectorizer (Countvectorizer): Def build_ Analyzer: Analyzer = Super (Stemmedcountvectorizer, self). Build_analyzer () return lambda Doc: (english_ Stemmer.stem (W) for W in Analyzer (DOC) #vectorizer = Stemmedcountvectorizer (min_df=1, stop_words= ' 中文版 ') from Sklearn.feature_extraction.text Import Tfidfvectorizerclass Stemmedtfidfvectorizer (tfidfvectorizer): Def build_ Analyzer: Analyzer = Super (Stemmedtfidfvectorizer, self). Build_analyzer () return lambda Doc: (english_ Stemmer.stem (W) for W in Analyzer (DOC) Vectorizer = Stemmedtfidfvectorizer (min_df=1, stop_words= ' 中文版 ') print ( Vectorizer) X_train = Vectorizer.fit_transform (posts) num_samples, Num_features = X_train.shapeprint ("#samples:%d, #features:%d"% (Num_samples, num_features)) New_post_vec = Vectorizer.transform ([New_post]) print (New_post_vec, type (New_post_vec)) print (New_post_vec.toarray ()) Print (vectorizer.get_feature_ Names ()) def dist_raw (v1, v2): Delta = v1-v2 return Sp.linalg.norm (Delta.toarray ()) def dist_norm (v1, v2): V1_nor malized = V1/sp.linalg.norm (V1.toarray ()) v2_normalized = V2/sp.linalg.norm (V2.toarray ()) Delta = v1_normalized -V2_normalized return Sp.linalg.norm (Delta.toarray ()) dist = Dist_normbest_dist = Sys.maxsizebest_i = Nonefor I in rang E (0, num_samples): Post = Posts[i] if post = = New_post:continue Post_vec = X_train.getrow (i) d = dist (  Post_vec, New_post_vec) print ("= = = Post%i with dist=%.2f:%s"% (I, D, post)) if D < best_dist:best_dist = d best_i = iprint ("Best post was%i with dist=%.2f"% (best_i, best_dist))

4. Summary

The steps included in the text preprocessing process are summarized as follows:

(1) cut a dime;

(2) Throw away words that appear too frequent and do not help to match related documents;

(3) Throw away the words that appear very low frequency, only very small may appear in the future post;

(4) To count the remaining words;

(5) Consider the whole expected set and calculate the TF-IDF value from the word frequency statistic.

Through this process, we convert a bunch of noisy text into a concise feature representation. However, although the word bag model and its extension are simple and effective, there are still some drawbacks to note:

(1) It does not cover the association between words. Using the previous Vectorization method, the text "car hits wall" and "wall hits car" will have the same eigenvector.

(2) It cannot capture negative relationships. For example "I'll eat ice cream" and "I'll not eat ice cream", although they mean the opposite, they are very similar from eigenvectors. This problem is very easy to solve, just need to count both a single word (also known as Unigrams), but also consider a team of words (bigrams) or trigrams (a line of three words) can be.

(3) The misspelled word will fail to process.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

The application of machine learning system design Scikit-learn do text classification (UP)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.