Naive Bayesian classification of sparkmlib classification algorithm

Source: Internet
Author: User

Naive Bayesian classification of sparkmlib classification algorithm

    (i) naive Bayesian Classification comprehension

Naive Bayesian method is a classification method based on Bayesian theorem and independent hypothesis of characteristic condition. In simple terms, the naive Bayesian classifier assumes that each characteristic of a sample is unrelated to other Characteristics. For example, If a fruit has a red, round, or roughly 4-inch diameter, the fruit can be judged to be an apple. Although these characteristics are interdependent or some characteristics are determined by other characteristics, the naive Bayesian classifier considers these properties to be independent of the probability distribution of whether the fruit is an apple. Although it is with these naïve ideas and simplistic assumptions, naive Bayesian classifier can still achieve quite good results in many complex reality situations. One of the advantages of naive Bayesian classifier is that it only needs to estimate the necessary parameters based on a small amount of training data (the discrete variable is a priori probability and a class conditional probability, the continuous type variable is the mean and variance of the variable).

Example Explanation:

              

                  

The prior probabilities computed from the data set, and the class conditional probabilities of each discrete attribute, the parameters of the class conditional probability distribution of the continuous attribute (sample mean and variance) are as Follows:

Prior probability: P (Yes) =0.3;p (No) =0.7

P (with room = yes | No) = 3/7

P (with room = No | No) = 4/7

P (with room = yes | Yes) = 0

P (with room = No | Yes) = 1

P (marital status = Single | No) = 2/7

P (marital status = Divorce | No) = 1/7

P (marital status = married | No) = 4/7

P (marital status = Single | Yes) = 2/3

P (marital status = Divorce | Yes) = 1/3

P (marital status = married | Yes) = 0

Annual income:

If class =no: sample mean = 110; Sample Variance =2975

If class =yes: sample mean = 90; Sample Variance =25

--"waiting to be predicted record: x={= no, Marital status = married, Annual income =120k}

P (no) *p (with room = No | No) *p (marital status = married | No) *p (annual Income =120k| No) =0.7*4/7*4/7*0.0072=0.0024

P (yes) *p (with room = No | Yes) *p (marital status = married | Yes) *p (annual Income =120k| Yes) =0.3*1*0*1.2*10-9=0

Since 0.0024 is greater than 0, the record is categorized as No.

As can be seen from the above example, if the class conditional probability of an attribute equals 0, the posterior probability of the entire class is equal to 0. The method of estimating class conditional probabilities using only a record scale is too fragile, especially if the training sample is few and the number of attributes is Many. The solution to this problem is to use the M estimation method to estimate the conditional probabilities:

    (ii) sparkmllib realization of naive Bayesian algorithm application

1, Data Set download: http://www.kaggle.com/c/stumbleupon/data (train.txt and Test.txt

2, Data Set preprocessing

1, Remove the first line: sed 1d train.tsv >train_nohead.tsv

2, the removal of interference data and processing data is not equal, so as to obtain training data set:

 val orig_file=sc.textfile ("train_nohead.tsv" )//load DataSet  // println (orig_file.first ()  /*   Naive Bayes model requires a non-negative eigenvalue  */  val ndata_file  =orig_file.map (_.split (" \ t " =&G T         Val trimmed  =r.map (_.replace ("\" "," " =trimmed (r.length-1 =trimmed.slice (4,r. length-1). map (d = if  (d== "?") 0.0//filter feature: only numeric features are filtered here, with 0 complement 
for missing values
        Else d.todouble). map (d andifelse  D)//convert negative features to 0        labeledpoint (lable, Vectors.dense (feature))    }

3, Training Bayesian model, and evaluation model (exact value, PR curve, Roc curve)

Val model_nb=Naivebayes.train (ndata_file)/*the correct rate of Bayesian classification results*/Val correct_nb=ndata_file.map{ point=if(model_nb.predict (point.features) = =Point.label)1Else0}.sum ()/ndata_file.count ()//0.5803921568627451/*Accuracy-recall Rate (PR) curve * and ROC curve output*/Val METRICSNB=Seq (model_nb). Map{model=Val socreandlabels=Ndata_file.map { point=(model.predict (point.features), point.label)} Val Metrics=Newbinaryclassificationmetrics (socreandlabels) (MODEL.GETCLASS.GETSIMPLENAME,METRICS.AREAUNDERPR (), Metrics.areaunderroc ())}metricsnb.foreach{ case(m, pr, Roc) = =println (f"$m, area under PR: ${PR * 100.0}%2.4f%%, area under ROC: ${roc * 100.0}%2.4f%%")    }/*naivebayesmodel, area under pr:68.0851%, area under roc:58.3559%*/

4, Model Tuning

1, change the characteristics of the selection, Select the use of text features (1-of-k) method

/*new feature, Select the third column text feature*/Val Categories= Orig_file.map (_.split ("\ t")). Map (r = R (3). distinct.collect.zipWithIndex.toMap val datanb= Orig_file.map (_.split ("\ t")). Map {r =Val trimmed= R.map (_.replaceall ("\" "," ")) Val Label= Trimmed (r.length-1). toint val categoryidx= Categories (r (3)) Val categoryfeatures=array.ofdim[double] (categories.size) categoryfeatures (categoryidx)= 1.0labeledpoint (label, vectors.dense (categoryfeatures))}/*Training naive Bayesian*/Val model_nb=Naivebayes.train (datanb)/*the correct rate of Bayesian classification results*/Val correct_nb=datanb.map{ point=if(model_nb.predict (point.features) = =Point.label)1Else0}.sum ()/ndata_file.count ()//0.6096010818120352/*PR curve and AOC curve*/Val METRICSNB=Seq (model_nb). Map{model=Val socreandlabels=Datanb.map { point=(model.predict (point.features), point.label)} Val Metrics=Newbinaryclassificationmetrics (socreandlabels) (MODEL.GETCLASS.GETSIMPLENAME,METRICS.AREAUNDERPR (), Metrics.areaunderroc ())}metricsnb.foreach{ case(m, pr, Roc) = =println (f"$m, area under PR: ${PR * 100.0}%2.4f%%, area under ROC: ${roc * 100.0}%2.4f%%")    }/*naivebayesmodel, area under pr:74.0522%, area under roc:60.5138%*/

2, Modify the parameters, the effect is not very obvious

      

/*Change the label value*/def trainnbwithparams (input:rdd[labeledpoint], Lambda:double)={val NB=Newnaivebayes nb.setlambda (lambda) nb.run (input)} val nbresults= Seq (0.001, 0.01, 0.1, 1.0, 10.0). Map {param =Val Model=trainnbwithparams (datanb, Param) val scoreandlabels= Datanb.map {point = =(model.predict (point.features), point.label)} Val Metrics=Newbinaryclassificationmetrics (scoreandlabels) (s"$param lambda", metrics.areaunderroc)} Nbresults.foreach { case(param, Auc) = println (f "$param, AUC = ${AUC * 100}%2.2f%%")    }
/*results

0.001 lambda, AUC = 60.51%
0.01 lambda, AUC = 60.51%
0.1 lambda, AUC = 60.51%
1.0 lambda, AUC = 60.51%
10.0 lambda, AUC = 60.51%

*/

    Reference Url:

http://blog.csdn.net/han_xiaoyang/article/details/50629608

Spark Machine Learning Books

Naive Bayesian classification of sparkmlib classification algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.