Implementation and analysis of naive Bayesian algorithm based on MapReduce

Source: Internet
Author: User

Naive Bayes (Na?ve Bayes) classifier
1.1 Formula
      • Naive Bayes is a probability classifier
      • The probability of document D belonging to Category C is calculated as follows (polynomial model):

      • nd is the length of the document (number of entries)
      • P (TK |c) is the probability that the term TK appears in the category C document, i.e. a meta-language model of category C documents
      • P (TK |c) measures the contribution of TK when C is the correct category
      • P (c) is a priori probability of class C
      • If the document's term cannot provide information about which category it belongs to, then we directly select the highest category in P (c)
1.2 categories with maximum posteriori probability

§ The objective of naive Bayesian classification is to find the "best" category

§ The best category is the category CMap with the maximum posterior probability (maximum a posteriori-map):

1.3 Logarithmic computation

§ The product of many small probabilities causes the overflow of floating-point numbers

§ because log (xy) = log (x) + log (y), the original product calculation can be calculated by taking the logarithm into a summation

§ Because log is a monotone function, the category with the highest score does not change

Therefore, it is often used in practice:

1.4 0 Probability problem

If a word item no longer appears in a category, the document containing the word item belongs to the probability p=0 of that category.

That is, the class cannot be judged once the order probability has occurred.

Workaround: Add a smoothing.

§ Before smoothing:

§ Smooth: Add 1 to each volume

§b is a different number of words (in this case the glossary size | V | = B) 1.5 of two common models

There is a need to mention the two independence assumptions of the Bayesian model: positional Independence and conditional independence.

Polynomial model and Bayesian effort model. The former considers the number of occurrences, the latter only considering the occurrence and not appearing, namely 0 and 1 problems.

1.6 Algorithm Process

Training process:

Test/application/classification:

1.7 Example

First, the first step, the parameter estimate:

Then, the second step, Category:

Therefore, the classifier classifies the test document into the C = China class because the d5 of the positive Chinese appears 3 times more than the inverse of the Japan and Tokyo weights.

Second, based on the parallel implementation of Mr

In two stages, the first is to train to get the classifier, then the prediction.

File input format: Each line represents a text in the format: Category name + file name + text content

2.1 Training Stages

The training phase calculates two probabilities: [1] a priori probability for each category [2] each term (word) has a conditional probability in each category.

The polynomial model is used to calculate the conditional probability.

Here are two job completion, pseudo code as follows:

These two pieces of code from Dongxicheng's blog, the original address http://dongxicheng.org/data-mining/naive-bayes-in-hadoop/

2.2 Test phase

Load the data from the training phase into memory, calculate the probability of the document in each category, and find the category with the greatest probability.

Three, Mr Analysis

Test data: Sogou Lab Http://www.sogou.com/labs/resources.html?v=1

The first step here is to turn all the documents into the desired text format, where one line represents a piece of news.

Training set: 75,000 news; test set: 5,000 news

The accuracy of the final measurement was 82%.

Reference: Dongxicheng http://dongxicheng.org/data-mining/naive-bayes-in-hadoop/

Simandou XIAOP

Source: http://www.cnblogs.com/panweishadow/

For non-commercial purposes, you are free to reprint, but please retain the original author information and article link URL.

Implementation and analysis of naive Bayesian algorithm based on MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.