Implementation and analysis of naive Bayesian algorithm based on MapReduce

Last Update:2015-03-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Naive Bayes (Na?ve Bayes) classifier

1.1 Formula

Naive Bayes is a probability classifier
The probability of document D belonging to Category C is calculated as follows (polynomial model):

nd is the length of the document (number of entries)
P (TK |c) is the probability that the term TK appears in the category C document, i.e. a meta-language model of category C documents
P (TK |c) measures the contribution of TK when C is the correct category
P (c) is a priori probability of class C
If the document's term cannot provide information about which category it belongs to, then we directly select the highest category in P (c)

1.2 categories with maximum posteriori probability
§ The objective of naive Bayesian classification is to find the "best" category

§ The best category is the category CMap with the maximum posterior probability (maximum a posteriori-map):

1.3 Logarithmic computation
§ The product of many small probabilities causes the overflow of floating-point numbers

§ because log (xy) = log (x) + log (y), the original product calculation can be calculated by taking the logarithm into a summation

§ Because log is a monotone function, the category with the highest score does not change

Therefore, it is often used in practice:

1.4 0 Probability problem

If a word item no longer appears in a category, the document containing the word item belongs to the probability p=0 of that category.

That is, the class cannot be judged once the order probability has occurred.

Workaround: Add a smoothing.

§ Before smoothing:

§ Smooth: Add 1 to each volume

§b is a different number of words (in this case the glossary size | V | = B) 1.5 of two common models

There is a need to mention the two independence assumptions of the Bayesian model: positional Independence and conditional independence.

Polynomial model and Bayesian effort model. The former considers the number of occurrences, the latter only considering the occurrence and not appearing, namely 0 and 1 problems.
1.6 Algorithm Process
Training process:

Test/application/classification:

1.7 Example

First, the first step, the parameter estimate:

Then, the second step, Category:

Therefore, the classifier classifies the test document into the C = China class because the d5 of the positive Chinese appears 3 times more than the inverse of the Japan and Tokyo weights.

Second, based on the parallel implementation of Mr

In two stages, the first is to train to get the classifier, then the prediction.

File input format: Each line represents a text in the format: Category name + file name + text content

2.1 Training Stages

The training phase calculates two probabilities: [1] a priori probability for each category [2] each term (word) has a conditional probability in each category.

The polynomial model is used to calculate the conditional probability.

Here are two job completion, pseudo code as follows:

These two pieces of code from Dongxicheng's blog, the original address http://dongxicheng.org/data-mining/naive-bayes-in-hadoop/

2.2 Test phase

Load the data from the training phase into memory, calculate the probability of the document in each category, and find the category with the greatest probability.

Three, Mr Analysis

Test data: Sogou Lab Http://www.sogou.com/labs/resources.html?v=1

The first step here is to turn all the documents into the desired text format, where one line represents a piece of news.

Training set: 75,000 news; test set: 5,000 news

The accuracy of the final measurement was 82%.

Reference: Dongxicheng http://dongxicheng.org/data-mining/naive-bayes-in-hadoop/

Simandou XIAOP

Source: http://www.cnblogs.com/panweishadow/

For non-commercial purposes, you are free to reprint, but please retain the original author information and article link URL.

Implementation and analysis of naive Bayesian algorithm based on MapReduce

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Implementation and analysis of naive Bayesian algorithm based on MapReduce

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Implementation and analysis of naive Bayesian algorithm based on MapReduce

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support