Spark Bayesian Classification algorithm

Last Update:2017-09-14 Source: Internet

Author: User

Tags natural logarithm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the mathematical basis of Bayesian theorem

We all know that the conditional probability is in the form of a mathematical formula

The probability that a occurs when B occurs is equal to the probability that A and b occur at the same time divided by B.

According to this formula, the Bayesian formula is obtained: Bayesian law is a law about conditional probability (or edge probability) of random events A and B. In general, the probability of a condition in which event a occurs in event B is not the same as the probability of event B in event A, and the Bayesian law describes the relationship between the two.

Further, the Bayesian formula is generalized, assuming that the probability of occurrence of event A is determined by a series of factors (A1,a2,a3,... An), then the full probability formula for event A is:

ii. naive Bayesian classification

Naive Bayesian classification is a very simple classification algorithm, its thought basis is: for a given category to be classified, the probability of the occurrence of each category under the condition of this term, which is the largest, it is considered that the classification of the category belongs to.

Suppose v= (v1,v2,v3....vn) is a sub-item, and the VN is a V of each eigenvector;

B= (B1,B2,B3...BN) is a collection of classifications, bn for each specific classification;

If you need to test which specific classification a vn belongs to in the B collection, you need to calculate P (bn| V) which is most likely to be attributed to B1,b2,b3,.... bn in the case of V. That

Therefore, the question is converted to what is the probability that each item is assigned to a specific classification in the collection. And this, the specific probability of the method can use Bayesian law.

After transformation:

third, the mllib corresponding API

　　1. Bayesian Classification Companion object Nativebayes, prototype:

Object Naivebayes extends Scala. Anyref with Scala. Serializable {  def train (Input:org.apache.spark.rdd.rdd[org.apache.spark.mllib.regression.labeledpoint]): Org.apache.spark.mllib.classification.NaiveBayesModel = {/* Compiled code *  /} def train (input: Org.apache.spark.rdd.rdd[org.apache.spark.mllib.regression.labeledpoint], Lambda:scala. Double): Org.apache.spark.mllib.classification.NaiveBayesModel = {/* Compiled code */}}

　　It mainly defines the train method of training Bayesian classification model, in which input is the training sample and lambda is the smoothing factor parameter.

2, train method, which is a static method of the Nativebayes object, based on the naïve Bayesian classification parameters set to create a naïve Bayesian classification class, and execute the Run method to train.

3, naive Bayesian classification class Naivebayes, prototype:

classNaivebayesPrivate(Private varLambda:scala. Double) extends Scala. Anyref with Scala. Serializable with org.apache.spark.Logging {def This() = {/*compiled code*/} def setlambda (Lambda:scala. Double): Org.apache.spark.mllib.classification.NaiveBayes= {/*compiled code*/} def run (Data:org.apache.spark.rdd.rdd[org.apache.spark.mllib.regression.labeledpoint]): Org.apache.spark.mllib.classification.NaiveBayesModel= {/*compiled code*/ }}

　　4, the Run method, the method mainly calculates the prior probability and the conditional probability. First, all sample data is aggregated, with label key, the feature features of the same label is aggregated, all label statistics are obtained (sum of label,features), and then P (i), and Theta (i) are calculated based on the label statistic, j), finally, a Bayesian model is generated based on the category tag list, the class prior probability, and the conditional probabilities of each feature under each category.

Prior probability and logarithm p (i) =log (P (yi)) =log ((number of times + smoothing factor of Class I)/(total number + number of classes * smoothing factor))

The conditional probabilities of each characteristic attribute, and take the logarithm

Theta (i) (j) =log (P (ai|yi)) =log (Sumtermfreqs (j) + smoothing Factor)-thetalogdenom

Wherein, Theta (i) (j) is the probability of feature J under Category I, Sumtermfreqs (j) is the number of feature J occurrences, thetalogdenom generally divided into 2 cases, as follows:

1. Polynomial model

Thetalogdenom=log (sumtermfreqs.values.sum+ numfeatures* Lambda)

Where the total number of sumTermFreqs.values.sum category I, numfeatures feature number, lambda smoothing factor

2. Bernoulli model

Thetalogdenom=log (N+2.0*LAMBDA)

5, aggregated: aggregated statistics for all samples, statistics of the sum and number of each eigenvalue under a category.

6. Pi indicates the value of the natural logarithm of the prior probability of each category ·

7. Theta indicates the conditional probability value of each feature in each category

8, predict: According to the model of the prior probability, conditional probability, calculate the probability of the sample belongs to each category, take the largest item as the category of the sample

9. Bayesian Classification Model Naivebayesmodel contains parameters: the category tag list (labels), the category prior probability (PI), and the conditional probabilities (theta) of each feature in each category.

　　Iv. Examples of Use

1. Sample data:

0,1 0 00,2 0 01,0 1 01,0 2 02,0 0 12,0 0 2

ImportOrg.apache.spark.mllib.classification.NaiveBayesImportorg.apache.spark.mllib.linalg.VectorsImportorg.apache.spark.mllib.util.MLUtilsImportOrg.apache.spark. {sparkconf, sparkcontext}object Bayes {def main (args:array[string]): Unit={val conf=NewSparkconf (). Setappname ("Bayesdemo"). Setmaster ("local") Val SC=Newsparkcontext (conf)//read sample data, which uses its own data handling method ·Val Data=mlutils.loadlabeledpoints (SC, "D://bayes.txt")    //Training Bayesian ModelsVal Model=naivebayes.train (data,1.0)    //Model.labels.foreach (println)//Model.pi.foreach (println)Val Test=vectors.dense (0,0,100) Val Res=model.predict (Test) println (res)//output result is 2.0}}

Importorg.apache.log4j. {level, Logger}ImportOrg.apache.spark.mllib.classification.NaiveBayesImportorg.apache.spark.mllib.linalg.VectorsImportOrg.apache.spark.mllib.regression.LabeledPointImportOrg.apache.spark. {sparkconf, sparkcontext}object Bayes {def main (args:array[string]): Unit= {    //To create a spark objectVal conf=NewSparkconf (). Setappname ("Bayesdemo"). Setmaster ("local") Val SC=Newsparkcontext (conf) Logger.getRootLogger.setLevel (Level.warn)//Reading Sample DataVal data=sc.textfile ("D://bayes.txt")//Read Data Val demo=data.map{line=>//processing dataval Parts=line.split (', ')//Split data · Labeledpoint (Parts (0). todouble,//Label Data Conversion vectors.dense (parts (1). Split (') . Map (_.todouble))//vector data Conversion}//divide the sample data into training samples and test samplesVal Sp=demo.randomsplit (Array (0.6,0.4), seed = 11L)//assigning data to Val train=SP (0)//Training data val Testing=SP (1)//test data//establishment of Bayesian classification model and trainingVal Model=naivebayes.train (Train,lambda = 1.0)    //Test the test sampleVal Pre=testing.map (p=>(Model.predict (p.features), P.label))//Validate model Val Prin=pre.take (20) println ("Prediction" + "\ T" + "label")     for(i<-0 to Prin.length-1) {println (Prin (i). _1+ "\ T" +Prin (i). _2)}
　　　　Val accuracy=1.0 *pre.filter (x=>x._1==x._2). Count ()//calculation accuracy

println (accuracy)

}
}

Spark Bayesian Classification algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More