First, the mathematical basis of Bayesian theorem
We all know that the conditional probability is in the form of a mathematical formula
The probability that a occurs when B occurs is equal to the probability that A and b occur at the same time divided by B.
According to this formula, the Bayesian formula is obtained: Bayesian law is a law about conditional probability (or edge probability) of random events A and B. In general, the probability of a condition in which event a occurs in event B is not the same as the probability of event B in event A, and the Bayesian law describes the relationship between the two.
Further, the Bayesian formula is generalized, assuming that the probability of occurrence of event A is determined by a series of factors (A1,a2,a3,... An), then the full probability formula for event A is:
ii. naive Bayesian classification
Naive Bayesian classification is a very simple classification algorithm, its thought basis is: for a given category to be classified, the probability of the occurrence of each category under the condition of this term, which is the largest, it is considered that the classification of the category belongs to.
Suppose v= (v1,v2,v3....vn) is a sub-item, and the VN is a V of each eigenvector;
B= (B1,B2,B3...BN) is a collection of classifications, bn for each specific classification;
If you need to test which specific classification a vn belongs to in the B collection, you need to calculate P (bn| V) which is most likely to be attributed to B1,b2,b3,.... bn in the case of V. That
Therefore, the question is converted to what is the probability that each item is assigned to a specific classification in the collection. And this, the specific probability of the method can use Bayesian law.
After transformation:
third, the mllib corresponding API
1. Bayesian Classification Companion object Nativebayes, prototype:
Object Naivebayes extends Scala. Anyref with Scala. Serializable { def train (Input:org.apache.spark.rdd.rdd[org.apache.spark.mllib.regression.labeledpoint]): Org.apache.spark.mllib.classification.NaiveBayesModel = {/* Compiled code * /} def train (input: Org.apache.spark.rdd.rdd[org.apache.spark.mllib.regression.labeledpoint], Lambda:scala. Double): Org.apache.spark.mllib.classification.NaiveBayesModel = {/* Compiled code */}}
It mainly defines the train method of training Bayesian classification model, in which input is the training sample and lambda is the smoothing factor parameter.
2, train method, which is a static method of the Nativebayes object, based on the naïve Bayesian classification parameters set to create a naïve Bayesian classification class, and execute the Run method to train.
3, naive Bayesian classification class Naivebayes, prototype:
classNaivebayesPrivate(Private varLambda:scala. Double) extends Scala. Anyref with Scala. Serializable with org.apache.spark.Logging {def This() = {/*compiled code*/} def setlambda (Lambda:scala. Double): Org.apache.spark.mllib.classification.NaiveBayes= {/*compiled code*/} def run (Data:org.apache.spark.rdd.rdd[org.apache.spark.mllib.regression.labeledpoint]): Org.apache.spark.mllib.classification.NaiveBayesModel= {/*compiled code*/ }}
4, the Run method, the method mainly calculates the prior probability and the conditional probability. First, all sample data is aggregated, with label key, the feature features of the same label is aggregated, all label statistics are obtained (sum of label,features), and then P (i), and Theta (i) are calculated based on the label statistic, j), finally, a Bayesian model is generated based on the category tag list, the class prior probability, and the conditional probabilities of each feature under each category.
Prior probability and logarithm p (i) =log (P (yi)) =log ((number of times + smoothing factor of Class I)/(total number + number of classes * smoothing factor))
The conditional probabilities of each characteristic attribute, and take the logarithm
Theta (i) (j) =log (P (ai|yi)) =log (Sumtermfreqs (j) + smoothing Factor)-thetalogdenom
Wherein, Theta (i) (j) is the probability of feature J under Category I, Sumtermfreqs (j) is the number of feature J occurrences, thetalogdenom generally divided into 2 cases, as follows:
1. Polynomial model
Thetalogdenom=log (sumtermfreqs.values.sum+ numfeatures* Lambda)
Where the total number of sumTermFreqs.values.sum category I, numfeatures feature number, lambda smoothing factor
2. Bernoulli model
Thetalogdenom=log (N+2.0*LAMBDA)
5, aggregated: aggregated statistics for all samples, statistics of the sum and number of each eigenvalue under a category.
6. Pi indicates the value of the natural logarithm of the prior probability of each category ·
7. Theta indicates the conditional probability value of each feature in each category
8, predict: According to the model of the prior probability, conditional probability, calculate the probability of the sample belongs to each category, take the largest item as the category of the sample
9. Bayesian Classification Model Naivebayesmodel contains parameters: the category tag list (labels), the category prior probability (PI), and the conditional probabilities (theta) of each feature in each category.
Iv. Examples of Use
1. Sample data:
0,1 0 00,2 0 01,0 1 01,0 2 02,0 0 12,0 0 2
ImportOrg.apache.spark.mllib.classification.NaiveBayesImportorg.apache.spark.mllib.linalg.VectorsImportorg.apache.spark.mllib.util.MLUtilsImportOrg.apache.spark. {sparkconf, sparkcontext}object Bayes {def main (args:array[string]): Unit={val conf=NewSparkconf (). Setappname ("Bayesdemo"). Setmaster ("local") Val SC=Newsparkcontext (conf)//read sample data, which uses its own data handling method ·Val Data=mlutils.loadlabeledpoints (SC, "D://bayes.txt") //Training Bayesian ModelsVal Model=naivebayes.train (data,1.0) //Model.labels.foreach (println)//Model.pi.foreach (println)Val Test=vectors.dense (0,0,100) Val Res=model.predict (Test) println (res)//output result is 2.0}}
Importorg.apache.log4j. {level, Logger}ImportOrg.apache.spark.mllib.classification.NaiveBayesImportorg.apache.spark.mllib.linalg.VectorsImportOrg.apache.spark.mllib.regression.LabeledPointImportOrg.apache.spark. {sparkconf, sparkcontext}object Bayes {def main (args:array[string]): Unit= { //To create a spark objectVal conf=NewSparkconf (). Setappname ("Bayesdemo"). Setmaster ("local") Val SC=Newsparkcontext (conf) Logger.getRootLogger.setLevel (Level.warn)//Reading Sample DataVal data=sc.textfile ("D://bayes.txt")//Read Data Val demo=data.map{line=>//processing dataval Parts=line.split (', ')//Split data · Labeledpoint (Parts (0). todouble,//Label Data Conversion vectors.dense (parts (1). Split (') . Map (_.todouble))//vector data Conversion}//divide the sample data into training samples and test samplesVal Sp=demo.randomsplit (Array (0.6,0.4), seed = 11L)//assigning data to Val train=SP (0)//Training data val Testing=SP (1)//test data//establishment of Bayesian classification model and trainingVal Model=naivebayes.train (Train,lambda = 1.0) //Test the test sampleVal Pre=testing.map (p=>(Model.predict (p.features), P.label))//Validate model Val Prin=pre.take (20) println ("Prediction" + "\ T" + "label") for(i<-0 to Prin.length-1) {println (Prin (i). _1+ "\ T" +Prin (i). _2)}
Val accuracy=1.0 *pre.filter (x=>x._1==x._2). Count ()//calculation accuracy
println (accuracy)
}
}
Spark Bayesian Classification algorithm