"Machine learning Experiment" uses naive Bayes to classify text

Source: Internet
Author: User
Tags idf

Introduction

Naive Bayes is a simple and powerful probabilistic model extended by Bayes theorem, which determines the probability that an object belongs to a certain class according to the probability of each characteristic. The method is based on the assumption that all features need to be independent of each other, that is, the value of either feature has no association with the value of other characteristics.
Although this assumption of condition independence may not be well met or even tenable in many applications. However, this simplified Bayesian classifier has obtained better classification accuracy in many practical applications. the process of training the model can be regarded as the calculation of the probability of the related condition, which can be estimated by the frequency of the characteristic of a certain category.
One of the most successful applications of naive Bayesian is the field of natural language processing, where data from natural language processing can be viewed as data that is annotated in a text document that can be trained as a training data set using machine learning algorithms.
In this section, we mainly introduce the use of naive Bayesian method for the classification of text, we will use a set of tagged categories of text documents to train naive Bayesian classifier, and then to the unknown data instances of the category prediction. This method can be used as a filter for spam messages.

Data set

The data of this experiment can get a set of news information through Scikit-learn.
The dataset consists of 19,000 news messages containing 20 different topics, including politics, sports, science, and more.
The data set can be divided into two parts: training and testing, and the partitioning of the training and testing data is based on a specific date.

There are two ways to load data:

  1. Sklearn.datasets.fetch_20newsgroups, the function returns a list of the original data that can be used as an interface for extracting text features (sklearn.feature_ Extraction.text.CountVectorizer) input
  2. Sklearn.datasets.fetch_20newsgroups_vectorized, the interface directly returns features that can be used directly, and can no longer use features to extract
fromimport fetch_20newsgroupsnews = fetch_20newsgroups(subset=‘all‘)print news.keys()print type(news.data), type(news.target), type(news.target_names)print news.target_namesprint len(news.data)print len(news.target)

Printing information:

[' DESCR ', ' data ', ' target ', ' target_names ', ' filenames ']

print news.data[0]print news.target[0], news.target_names[news.target[0]]

The printed news content is omitted, the category is 10, and the category name is Rec.sport.hockey.

Preprocessing of data

Machine learning algorithms can only work on numerical data, the algorithm expects to use fixed-length numerical characteristics rather than an indefinite length of the original text file, our next step is to convert the text dataset into a numeric dataset.
Now, we have only one feature: the textual content of the news message, we need a function to convert a piece of text into a set of meaningful numeric features.
Intuitively, you can try to look at the separate strings for each type of text (more accurately, tokens), and then describe the frequency distribution characteristics of the tagged words corresponding to each category. The sklearn.feature_extraction.text module has some useful tools for constructing numeric eigenvectors using text documents.

Dividing training and testing data

Before we do the conversion work, we need to divide the data into training and test datasets. Since the loaded data appears in random order, we can divide the data into two parts, 75% as the training data and 25% as the test data:

0.75split_size = int(len(news.data)*SPLIT_PERC)X_train = news.data[:split_size]X_test = news.data[split_size:]Y_train = news.target[:split_size]Y_test = news.target[split_size:]

Because the sklearn.datasets.fetch_20newsgroups training data and the test data can be selected according to the subset parameters, there are 11,314 training data, which is 60% of the total set, and the test data set is 40%. This can be achieved by:

news_train = fetch_20newsgroups(subset=‘train‘)news_test = fetch_20newsgroups(subset=‘test‘)X_train = news_train.dataX_test = news_test.dataY_train = news_train.targetY_test = news_test.target
Word Bag (bag of Words) characterization

The word bag model is a simple hypothesis in natural language processing and information retrieval. In this model, the text (paragraph or document) is treated as an unordered collection of words, ignoring the order of the syntax or even the word.
The word bag model is used in some methods of text categorization. When the traditional Bayesian classification is applied to the text, the conditional independence hypothesis in Bayes leads to the word bag model.
Scikit-learn provides some useful tools for extracting numeric features from text content in the most common way, such as:

  • Tag (tokenizing) text and an integer ID assigned to each possible token (token), such as a separator with spaces and punctuation marks (Chinese words involve word breakers)
  • The frequency at which the count (counting) mark (token) appears in each text
  • Standardization (normalizing) and weighting (weighting) in the process of decreasing the importance of markers appearing in most samples/documents
    Our strategy for translating the above-mentioned process from a bunch of text files into a numerical eigenvector is called a word bag .

Under this strategy, the features and samples are defined as follows:
See the frequency of each individual token (token) as a feature , whether normalized or not
The frequency component of all markers in a given document is considered a multivariate sample .
The corpus of such a text can be characterized as a matrix, where each row represents a document, and each column represents a tag word that appears in the corpus.

Text can be characterized by the occurrence frequency of words, so that the relative position of the word in the text can be completely ignored, which should guarantee the condition independence of Bayesian.

Sparse Nature

Most documents usually only use a subset of all the words in the corpus, so the resulting matrix will have many eigenvalues of 0 (usually more than 99% are 0).
For example, a group of 10,000 short texts (such as email) uses 100,000 of the total vocabulary, and each document uses 100 to 1,000 unique words.
In order to be able to store this matrix in memory and also provide the speed of matrix/vector algebra operations, sparse characterization, such as the representation provided in the Scipy.sparse package, is often used.

Interface for Text feature extraction

Sklearn.feature_extraction.text provides the following tools for constructing eigenvectors:

  • Feature_extraction.text.CountVectorizer ([...]) Convert A collection of text documents to a matrix of tokens counts
  • Feature_extraction.text.HashingVectorizer ([...]) Convert A collection of text documents to a matrix of tokens occurrences
  • Feature_extraction.text.TfidfTransformer ([...]) Transform a count matrix to a normalized tf or TF-IDF representation
  • Feature_extraction.text.TfidfVectorizer ([...]) Convert A collection of raw documents to a matrix of TF-IDF features.

Explain:

  • The Countvectorizer method constructs a dictionary of words, each of which is converted to a numeric feature of the eigenvector, and each element is the number of occurrences of a particular word in the text
  • The Hashingvectorizer method implements a hash function that maps the tag to the index of the feature, and its characteristics are computed with the Countvectorizer method
  • Tfidfvectorizer uses an advanced calculation method called term Frequency inverse Document
    Frequency (TF-IDF). This is a statistical method of measuring the importance of a word in text or corpus. Intuitively, the method seeks to find more frequent words in the current document by comparing the frequencies of the words throughout the corpus. This is a way to standardize the results, avoiding the fact that some words are too frequent to characterize an instance (I guess that a and and are more frequent in English, but they have little effect on characterizing a text)
Constructing naive Bayesian classifier

Because we use the number of occurrences of words as a feature, we can describe this feature with multiple distributions. Use the sklearn.naive_bayes class of the module in Sklearn MultinomialNB to build the classifier.
We use Pipeline this class to build composite classifiers (compound classifer) that contain Quantizer (vectorizers) and classifiers.

 fromSklearn.naive_bayesImportMultinomialnb fromSklearn.pipelineImportPipeline fromSklearn.feature_extraction.textImportTfidfvectorizer, Hashingvectorizer, Countvectorizer#nbc means naive Bayes classifierNbc_1 = Pipeline ([(' Vect ', Countvectorizer ()), (' CLF ', MULTINOMIALNB ()),]) nbc_2 = Pipeline ([(' Vect ', Hashingvectorizer (non_negative=True)),    (' CLF ', MULTINOMIALNB ()),]) Nbc_3 = Pipeline ([(' Vect ', Tfidfvectorizer ()), (' CLF ', MULTINOMIALNB ()),] NBCs = [Nbc_1, nbc_2, Nbc_3]
Cross-validation

Below we design a cross-validation function to test the performance of the classifier:

 fromSklearn.cross_validationImportCross_val_score, Kfold fromScipy.statsImportSemImportNumPy asNp def evaluate_cross_validation(CLF, X, y, K):    # Create a k-fold croos validation iterator of k=5 foldsCV = Kfold (len (y), K, shuffle=True, random_state=0)# By default the score used are the one returned by score method of the estimator (accuracy)Scores = Cross_val_score (CLF, X, y, CV=CV)PrintScoresPrint("Mean score: {0:.3f} (+/-{1:.3f})"). Format (Np.mean (scores), SEM (scores))

Divide the training data into 5 parts and output the validated score:

forin nbcs:    5)

The output is:

[0.82589483 0.83473266 0.8272205 0.84136103 0.83377542]
Mean score:0.833 (+/-0.003)
[0.76358816 0.72337605 0.72293416 0.74370305 0.74977896]
Mean score:0.741 (+/-0.008)
[0.84975696 0.83517455 0.82545294 0.83870968 0.84615385]
Mean score:0.839 (+/-0.004)

From the above results, it is shown that the method of feature extraction by Countvectorizer and Tfidfvectorizer is better than that of Hashingvectorizer.

Optimizing feature extraction to improve the effect of classification

Next, we use regular expressions to parse the text to get the tagged word.

Optimize extraction of Word rule parameters

TfidfVectorizerA parameter that token_pattern specifies the rule to extract the word.
The default regular expression is that the ur"\b\w\w+\b" regular expression matches only the word boundary and takes into account the underscore, and may also take into account the bars and points.
The new regular expression is ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b" .

nbc_4 = Pipeline([    (‘vect‘, TfidfVectorizer(                token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",    )),    (‘clf‘5)

[0.86478126 0.85461776 0.84489616 0.85505966 0.85234306]
Mean score:0.854 (+/-0.003)

This score has improved a bit more than the previous 0.839.

Optimization of ellipsis parameters

TfidfVectorizerA parameter of stop_words This parameter specifies that the word will be omitted from the list of marked words, such as some high-frequency words, but these words do not provide any priori support for a particular subject.

def  get_stop_words   () :  result = set () for  line in  open (, ). ReadLines (): Result.add (Line.strip ()) return  resultstop_words = get_stop_words () nbc_5 = Pipeline ([, Tfidfvectorizer (stop_words=stop_words, Token_pattern=ur" \b[a-z0-9_\-\. +[a-z][a-z0-9_\-\.] +\b ",")), ( ' CLF ' , MULTINOMIALNB ()),]) evaluate_cross_validation (Nbc_5, X_ Train, Y_train, 5 )  

[0.88731772 0.88731772 0.878038 0.88466637 0.88107869]
Mean score:0.884 (+/-0.002)

The score was also raised to 0.884.

Optimizing the Alpha parameters of the Bayesian classifier

MultinomialNBThere is a alpha parameter that is a smoothing parameter, which defaults to 1.0 and we set it to 0.01.

nbc_6 = Pipeline([    (‘vect‘, TfidfVectorizer(                stop_words=stop_words,                token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",             )),    (‘clf‘, MultinomialNB(alpha=0.015)

[0.91073796 0.92532037 0.91604065 0.91294741 0.91202476]
Mean score:0.915 (+/-0.003)

This score has been optimized for the better.

Evaluating classifier Performance

We have obtained better classifier parameters by cross-validation, and we can use this classifier to test our test data.

fromimport metricsnbc_6.fit(X_train, Y_train)print"Accuracy on training set:"print nbc_6.score(X_train, Y_train)print"Accuracy on testing set:"print nbc_6.score(X_test,Y_test)y_predict = nbc_6.predict(X_test)print"Classification Report:"print metrics.classification_report(Y_test,y_predict)print"Confusion Matrix:"print metrics.confusion_matrix(Y_test,y_predict)

Here only the output accuracy rate:

Accuracy on Training set:
0.997701962171
Accuracy on testing set:
0.846919808816

Resources

Wiki: Word bag model

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Baidu Search jasonding1354 access to my blog homepage

"Machine learning Experiment" uses naive Bayes to classify text

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.