Mining algorithm (1) Naive Bayesian algorithm

Source: Internet
Author: User
Tags idf

Original: http://www.blogchong.com/post/NaiveBayes.html

1 Document Description

This document is an introduction and analysis document of naive Bayesian algorithm, and it is explained in detail with application examples.

In fact, the concept of naive Bayesian and the process has been written rotten, the reason to write these is convenient to do a collation, record memo. And the example part is described in detail, the example on the network is relatively simple, there is no process.

As for the last part, it is an extension of naive Bayesian, of course, it simply describes the process, which involves the Chinese word segmentation and TFIDF algorithm, there is time to fill in the concrete.

2 Algorithm Introduction

2.1 Bayesian theorem

(1) problem extension: know a conditional probability, how to know the probability of two events after exchange. That is, in the known P (a| B) In the case of how to obtain P (b| A);

(2) Conditional probabilities: P (a| b) indicates that event B has occurred, the probability that event a occurs, is called Event B occurred under event a conditional probability;

(3) Basic Solution formula: P (a| b) =p (AB)/p (b)//ab the probability that the simultaneous occurrence of probability is in B;

(4) Bayesian theorem: P (b| A) =p (a| b) *p (b)/P (A);

2.2 naive Bayesian algorithm

2.2.1 Naïve Bayesian thought basis

For the given items to be classified, the probability of the occurrence of each category under the condition of this item is solved, which is the most important, which category is considered to be classified;

So naive Bayesian classifier in fact the fundamental idea is to classify probability and conditional probability, calculate the classification of the posterior probability, and then compare, the largest is that the item belongs to the category.

2.2.2 Naive Bayesian classification process

(1) Set X={a1, a2,..., am} is a category to be classified, and each A is a characteristic attribute of x.

(2) There are categories set C={y1,y2,..., yn}.

(3) Calculate P (y1|x), P (y2|x),..., P (yn|x). These are post-mortem probabilities.

(4) if P (yk|x) =max{P (y1|x), P (y2|x),..., P (yn|x)}, then x belongs to YK.

key points, calculating conditional probabilities ( as in the third step ) :

1) Find a collection of known categories of items to be categorized, which is called the training sample set.

2) The conditional probability estimates of each characteristic attribute in various categories are obtained by statistic:

P (a1|y1), P (a2|y1),..., P (am|y1); P (a1|y2), P (a2|y2),..., P (am|y2); P (A1|yn), P (A2|yn),..., p (Am|yn)

It is necessary to calculate the posteriori probability by conditional probability and classification probability

3) If each characteristic attribute is conditionally independent, then the Bayesian theorem is deduced as follows:

P (yi|x) = P (x|yi) p (Yi)/p (x)//actual calculation P (x|yi) p (Yi), because P (x) are the same

4) because the denominator is constant for all categories, as long as we maximize the numerator. And because each characteristic attribute is conditionally independent, there are:

P (x|yi) p (yi) = P (a1|yi) | P (A2|yi) ...     P (am|yi) p (yi) = P (yi) ∏p (aj|yi); To calculate the product by conditional probability

2.2.3 Naive Bayesian implementation process

As you can see, the entire naive Bayesian classification is divided into three stages:

First Stage -- preparatory work phase:

The task of this stage is to make the necessary preparation for naive Bayesian classification, the main work is to determine the characteristic attributes according to the specific situation, and to classify each characteristic attribute appropriately, and then categorize the part to be classified by manual, and form a training sample set. The input for this stage is all data to be classified, and the output is the feature attribute and training sample.

This phase is the only stage in the whole naive Bayesian classification that needs to be completed manually, and its quality will have an important influence on the whole process, the quality of classifier is determined by characteristic attribute, characteristic attribute division and Training sample quality to a great extent.

Phase II -- Classifier Training Phase :

The task at this stage is to generate the classifier, and the main work is to calculate the frequency of each category in the training sample and the conditional probability estimate of each characteristic attribute dividing each category, and record the result.

The input is the characteristic attribute and the training sample, and the output is the classifier. This stage is a mechanical phase, according to the formula discussed above can be completed automatically by the program.

Phase III -- Application phase:

The task at this stage is to classify the classification items using classifiers, whose input is the classifier and the item to be categorized, and the output is the mapping between the categories and the category. This stage is also a mechanical phase, completed by the program.

2.3-Piece Probabilistic solution supplement

In the calculation of the conditional probability of the characteristic attribute, if the characteristic attribute distributes non-discrete values but the continuous value, then the normal distribution will be in accordance with the characteristics of the general conditions, the conditional probability can be calculated by the formula:

and meet

Note:

(1) The first parameter is the mean value of the characteristic attribute: μ

(2) and the second parameter is the feature attribute variance: σ^2

(3) Conditional probability: P (ak|yi)

2.4 Algorithm Analysis

Naive Bayesian classifier is an important method in "supervised classification". Although it is simple, but in the current practical application still occupies a large proportion, it has the following characteristics:

(1) Easy to construct, the estimation of the parameters of the model does not require any complicated iterative solution framework, so the method is suitable for large scale data sets;

(2) Easy to understand, so even users unfamiliar with the classification technology can operate the process of this method;

(3) Good classification effect, for any application even if it is not the best classification method, but it is the most robust classification scheme;

3 Application Examples

// Patient Classification Examples

3.1 Example Description

Have a number of cases, through the disease, infer the type of disease patients.

Medical Chart: Symptoms Career disease

Sneeze Nurse with cold 18 people

Sneeze Teacher Cold 12 people

Sneeze Farmer Cold 7 people

Sneeze Workers cold 3 people

Sneeze Nurse Allergy 1 people

Sneeze Farmer Allergy 13 people

Sneeze Worker Allergy 16 people

Sneeze Farmer Concussion 3 people

Sneeze Workers concussion 7 people

Headache workers cold 10 people

Headache Teacher Cold 12 people

Headache Farmer Cold 5 people

Headache Nurses Cold 13 people

Headache workers allergic to 8 people

Headache Nurse Allergy 2 people

Headache Farmer Allergy 10 people

Headache Teacher Concussion 1 people

Headache Farmer Concussion 4 people

Headache workers concussion 15 people

Total number of people: 160

Classification requirements:

Suppose another patient, having an illness (sneeze), a career (worker), asked for the most likely type of illness?

3.2 naive Bayesian classification process 3.2.1 Preparation Phase

To determine a feature attribute:

The analysis is as follows, the disease for the category set, the sick man to be classified set (X), symptoms and occupations for the characteristic attribute set.

Ensure that sample feature attributes are irrelevant

Get Training samples:

The case table is a training sample.

3.2.2 Classifier Training Phase

Calculate P (Yi) for each category:

Category Division:

C={C0: Cold; C1: allergy; C2: Concussion};//c0=80; c1=50; C2=30

Then there is the class probability as follows (priori probability):

P (C0) =1/2;p (C1) =5/16;p (C2) =3/16

The conditional probabilities for all divisions are computed for each feature attribute:

Feature attributes:

Symptoms a0={a00: sneezing; A01: headache};

Occupation a1={a10: Nurse; A11: farmer; A12: worker; A13: teacher};

Calculate feature attribute Partitioning probabilities://Calculate symptom partitioning probabilities

P (a00| C0) =1/2; P (a01| C0) =1/2;

P (a00| C1) =3/5; P (a01| C1) =2/5;

P (a00| C2) =1/3; P (a01| C2) =2/3;

Calculate the probability of occupational division

P (a10| C0) =31/80; P (a11| C0) =12/80; P (a12| C0) =13/80; P (a13| C0) =24/80;

P (a10| C1) =3/50; P (a11| C1) =23/50; P (a12| C1) =24/50; P (a13| C1) =0;

P (a10| C2) =0; P (a11| C2) =7/30; P (a12| C2) =22/30; P (a13| C2) =1/30;

3.2.3 Application Phase

Naive Bayesian classifier:

Patient characteristic attributes:

A00 (sneezing); A12 (worker);

Classifier identification (computed posteriori probability):

C0 (Cold): P (C0) p (x| C0) =p (C0) P (a00| C0) P (a12| C0) = (1/2) * (1/2) * (13/80) =13/320=0.040625

C1 (Allergy): P (C1) p (x| C1) =p (C1) P (a00| C1) P (a12| C1) = (5/16) * (3/5) * (24/50) =9/100=0.09

C2 (Brain quake): P (C2) p (x| C2) =p (C2) P (a00| C2) P (a12| C2) = (3/16) * (1/3) * (22/30) =11/240=0.04583

Judge:

The probability of Allergy (C1) is greatest when classified by classifier.

3.3 Example Additions

(1) The original example has only 6 samples, the author expands to 160, the construction process may have the error, this example is only used to illustrate how to use naive Bayesian classifier classification;

(2) Through this example, we should master the specific use of naive Bayesian classifier;

(3) Training samples as large as possible, classification will be more accurate, attribute division as far as possible reasonable;

(4) The attribute dividing frequency is calculated by means of statistics;

(5) Other well-known examples: account detection, gender classification, Spam classification;

4 MapReduce implementations

If the simple Bayesian classifier with MapReduce is used to classify the case, the process is relatively simple, and a mapreduce training sample is trained to obtain the classifier. But the reality is often that the sample provided is a complex sample, and even the feature attribute is not divided.

As an example of text classification, the first text class sample characteristic attribute is a single word, if the English sample is directly split can be, if the Chinese sample, then the Chinese word segmentation, Chinese participle is a complex process.

4.1 Text Categorization Process

Examples of Weibo samples:

Chinese word segmentation can be processed by Itclas word breaker;

The judging parameters of wk of feature words: Calculating their TFIDF value (TFIDF algorithm);

(1) The total number of documents in the first MapReduce statistical training sample (Weibo), the number of documents with feature word WK in the training sample (using the Chinese word breaker Itclas for document Segmentation), and the TF value (Word frequency) for each feature term in each class;

There are several value outputs, respectively:

1) The TF value of the characteristic word wk in the class CJ; The number of occurrences of//tf= feature words/total words in the document

2) The number of documents that appear in WK in the class CJ;

3) The total number of WK documents appearing in the training sample;

4) The total number of training sample documents;

(2) The second Mr Calculates the TFIDF value (Tfidf=tfik*log (n/n)) according to the output of the first Mr, and the specific process is the IDF calculation on the map end, and the reduce end is TF*IDF calculated;

There are several value outputs, respectively:

1) The TFIDF value of each type of characteristic word;

(3) The third Mr Input discriminant document (a microblog), extracts the feature words, then calculates the maximum posterior probability according to the TFIDF value computed in the second Mr, and classifies it.

4.2 Implementation Supplement

(1) in the process of actual processing, we must control the granularity of itclas participle and extract the characteristic words accurately (to eliminate the interference words).

(2) The TFIDF algorithm can be improved in practical application.

(3) Relying on Mr for processing, relying on Hadoop for the processing of large quantities of samples, can be quickly trained to complete the sample;

5 Documentation Summary

The document is for memo only, and it would be nice if other friends could use it. Interested friends can get to know itclas Chinese word segmentation, as well as text classification in the TFIDF algorithm.

In addition, the author level is limited, welcome to correct the various problems in the text. Explore together and learn together!

Mining algorithm (1) Naive Bayesian algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.