Step by step to improve Naive Bayes Algorithm

Source: Internet
Author: User
Tags idf


Introduction

If your understanding of Naive Bayes is still in its infancy, you only understand the basic principles and assumptions and have not implemented product-level code, this article will help you improve the original Naive Bayes algorithm step by step. In this process, you will see some unreasonable aspects and limitations of Naive Bayes assumptions, so as to understand why these assumptions make the final classification result worse while simplifying your algorithm, the improvement methods are proposed to address these problems.

 

 

 

Naive Bayes)

Source: Machine Learning by Tom M. Mitchell)

Symbols and terminology

Assume that instance X to be classified can be described by the set of attribute values. The target category set is

Algorithm Derivation

The goal of Bayes is to obtain the most possible target value under the given instance attribute value. The expression is as follows:

This expression can be rewritten

Naive Bayes hypothesis

Naive Bayes hypothesis: Attributes of instance X are independent of each other.

According to the multiplication theorem of probability

Then the naive Bayes expression is

 

Polynomial Naive Bayes (multinomial Naive Bayes)

Symbols and terminology

Polynomial Bayes refers to a document as a sequence composed of many words and uses words in the document to build a polynomial model. The words in the document are treated as mathematical polynomials. Assume that the number of target category sets is determined. Therefore, each set can be expressed by a series of polynomial coefficients. In other words, for a class C, there is a polynomial coefficient (n is the dictionary size composed of all words), because it represents the possibility of a word appearing in Class C, therefore, the sum must be 1, and the formula is.

Formula Derivation

The probability of a document is the product of the likelihood that all word features appear in this category. The formula is as follows:

This section describes the number of times a word appears in this document. We can use the Bayesian formula and map hypothesis mentioned in the naive Bayes explanation above to perform step-by-step derivation.

Bayesian formula:

Map hypothesis and get the value log:

In the above formula, the weight of word I belongs to category C, and how to select it is the focus of scholars, which is related to the performance of the naive Bayes classifier.

 

Select

Source: heckerman, D. (1995). A tutorial on learning with Bayesian Networks (Technical Report MSR-TR-95-06). Microsoft Research

Based on this paper, a simple formula is given.

The number of times word I appears in the document set with category C, the total number of times all words appear in the document set with category C, and the smoothing coefficient of word I, is the sum of smoothing coefficients of all words. The common method here is to add Laplace smoothing, that is, if all the other words are equal to 1, it is equal to the dictionary size N.

 

The preceding formula can be substituted into the previous map hypothesis.

Here is the prior probability of class C, which can be calculated as the method used to calculate the prior probability of a word. However, it is worth noting that, in the preceding formula, the prior probability of the class is not dominant relative to the possibility of the word (the item on the right of the formula). Therefore, we can select the unified

This is the final polynomial Bayesian (MNB) formula.

Is

 

Complementary Naive Bayes

Complementary Naive Bayes can also be called complementary Naive Bayes.

Consider the following question: Assume that there is a training dataset with a certain category, that is, the number of documents in each category in the training dataset is different, what are the differences between the classification results?

If the number of class C is small, the weight of word I belonging to Class C is also low, which leads to the shift of the classification result to the number of classes. In order to reduce the impact, a method for calculating the complementary set is proposed to obtain the classification result, referred to as CNB.

Compared to the sum of MNB computing classes, CNB calculates the sum of classes other than Class C. The principle of doing so is that when there are many classes, the number of documents in the population set is relatively large, so that the number of supplements in each category is similar, reducing the impact of the number of category documents on the classification result.

Therefore, the CNB parameter is estimated to be

Therefore, the CNB formula is

Among them, the negative number is because the weight value of the word I belongs to category C and its complementary set is exactly the opposite. Therefore, we need to select the smaller weight of the complementary set as the classification result, that is, add a negative number before it.

 

 

Weight-normalized complementary Naive Bayes

Weight-normalized complementary Naive Bayes can also be called weight normalization complementary Naive Bayes.

Naive Bayes assumes that words are independent from each other. Although this simplifies the calculation, this assumption is not very satisfactory in actual situations.

In this case, we assume that some words are rarely separated, for example, San Francisco, because San and Francisco both appear together, the contribution to the weight will be twice that of a single San or Francisco, that is, the weight of San Francisco is double-counting, which will lead to inaccurate classifier. For example, if Boston appears five times in a document and San Francisco appears three times, MNB prefers to attribute the document to San Francisco (6 times) instead of Boston (5 times ).

One way to solve the problem is to normalize the weight and rewrite it

 

So far, we have made some improvements in the algorithm formula to reduce the influence of some unreasonable assumptions and bring the results closer to the actual situation. Now we can further improve the algorithm from another aspect: text modeling.

1. TFIDF

If you have used Lucene, you may be exposed to a concept called TFIDF. This model takes into account the ratio of the entire dataset of the document station containing words, therefore, the importance of a word to one of a collection or corpus can be evaluated more accurately than TF. Therefore, replace the text model with one.

Here we recommend that you use the TFIDF Calculation Method in Lucene.

1.

2. IDF = 1 + Log (numdocs/(1 + docfreq)

3. TFIDF = TF * IDF

 

2. Length norm

In most documents, words appear in their internal relationships, that is, if a word appears in a document, the word has a high probability of appearing again. MNB simply assumes that the probability of word occurrence is independent of each other. Therefore, the longer the document, the greater the impact of such dependency. Therefore, similar to the preceding weighted-normalized, we can normalize the length to reduce the impact of this dependency. Formula:

 

 

Summary

In summary, the naive Bayes can be improved from multiple aspects based on different perspectives and focuses. When selecting a model, you can first consider the size of the text dataset, the document length, the number of categories, and the distribution, and select an appropriate algorithm. For specific product-level code, see the naive Bayes section of the Apache mahout project, which implements CNB and TFIDF.

 

References:

Tackling the poor assumptions of Naive Bayes text Classifiers

Heckerman, D. (1995). A tutorial on learning with Bayesian Networks (Technical Report MSR-TR-95-06). Microsoft Research

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.