1. Preface:
Naive Bayes (naive Bayesian) is a simple multi-class classification algorithm, the premise of which is to assume that each feature is independent of each other . Naive Bayes training is mainly for each characteristic, under the condition of a given label, calculates the conditional probability of each characteristic under the condition of the label. Finally, the conditional probabilities after this training are used to predict.
Because the version of Spark I'm using is 1.3.0. It contains the naive Bayes which is multinomial NB. As of the time I wrote the article, the newest Spark1.6.0 contains multinomial naive Bayes and Bernoulli naive Bayes. Regardless of what Bayesian they support, we know that Bayesian algorithms are generally used for document classification.
2. Application of Bayesian
Naive Bayesian classification is a very simple classification algorithm, called it naive Bayesian classification is because the idea of this method is really very simple, naive Bayesian's ideological foundation is this: for the given to be classified, the probability of the occurrence of the conditions under this term, which is the largest, think of the classification of the category belongs to. In layman's terms, like this, you see a black man on the street, and I ask you, guess where this guy came from, you're going to guess Africa. Why is it? Because blacks have the highest rates of Africans, of course, they may also be American or Asian, but with no other information available, we will choose the category with the most conditional probabilities, which is the ideological foundation of naive Bayes.
3. Bayes principle
The formal definition of Naive Bayes classification is as follows:
(1), set as one to be classified, and each A as a characteristic attribute of x.
(2), there are categories set.
(3), calculation.
(4), if, then.
So the key now is how to calculate the probability of each condition in the 3rd step. We can do this:
(1), to find a known classification of the set of items to be categorized, this set is called the training sample set.
(2), the statistical results of the various characteristics of different properties of the conditional probability estimation. That
(3), if each characteristic attribute is condition independent, then according to Bayes theorem has the following derivation:
Because the denominator is constant for all categories, as long as we can maximize the numerator. And because each characteristic attribute is conditionally independent, there are:
Based on the above analysis, the naïve Bayesian classification process can be represented by a representation (for the time being, no validation is considered):
As you can see, the entire naive Bayesian classification is divided into three stages:
The first stage-the preparatory stage, the task of this stage is to make the necessary preparation for naive Bayesian classification, the main work is to determine the characteristic attributes according to the specific situation, and the appropriate division of each feature attribute, and then manually classify a portion of the items to be classified, forming a training sample set. The input for this stage is all data to be classified, and the output is the feature attribute and training sample. This phase is the only stage in the whole naive Bayesian classification that needs to be completed manually, and its quality will have an important influence on the whole process, the quality of classifier is determined by characteristic attribute, characteristic attribute division and Training sample quality to a great extent.
The second stage-the classifier training phase, the task of this stage is to generate the classifier, the main work is to calculate the frequency of each category in the training samples and each feature attribute division of each category of the conditional probability estimates, and the results recorded. The input is the characteristic attribute and the training sample, and the output is the classifier. This stage is a mechanical phase, according to the formula discussed above can be completed automatically by the program.
The third stage-the application phase. The task at this stage is to classify the classification items using classifiers, whose input is the classifier and the item to be categorized, and the output is the mapping between the categories and the category. This stage is also a mechanical phase, completed by the program.
3, Example
val conf = new sparkconf (). Setappname ("Simple Application"). Setmaster ("local") Val sc = new Sparkcontext (conf) val data = Sc.textfile ("Data/mllib/sample_naive_bayes_data.txt") Val parseddata = Data.map {line = val parts = line.split (', ') labeledpoint (parts (0). ToDouble, Vectors.dense (Parts (1). spli T ("). Map (_.todouble))}//Split data into training (60%) and test (40%). Val splits = Parseddata.randomsplit (Array (0.6, 0.4), seed = 11L) val training = Splits (0) Val test = splits (1) VA L model = naivebayes.train (training, lambda=1.0)//(training, lambda = 1.0, Modeltype = "Multinomial") Val Predictionand Label = Test.map (p = = (Model.predict (p.features), P.label)) Val accuracy = 1.0 * Predictionandlabel.filter (x = x . _1 = = x._2). Count ()/Test.count ()//Save and load Model Model.save (SC, "Mymodelpath") val Samemodel = Naivebay Esmodel.load (SC, "Mymodelpath") sc.stop ()
Spark MLlib's Naive Bayes