Bayesian LawMachine learning task: determines the best assumption in space H when training data D is given. Best assumption: one method is to define it as the most likely hypothetical bayesian theory under the knowledge of prior probability of different assumptions in given data d and H, the anterior probability based on assumptions, the probability of observing different data under a given hypothesis, and the observed data itself
Anterior probability and Posterior ProbabilityP (A) is used to represent the initial probability that a has before there is no training data. P (A) is called the prior probability of. The prior probability reflects the background knowledge about the opportunity where A is a correct assumption. Without this prior knowledge, you can simply assign each candidate hypothesis the same prior probability, P (B) indicates the prior probability of training data B. P (A | B) indicates the probability of a when B is set. In machine learning, we are concerned with P (B | A), that is, the probability that B is established when a is given, which is called the posterior probability of B.
Bayesian FormulaBayesian formula provides a method to calculate the posterior probability P (B | A) from the prior probability P (A), P (B), and P (A | B ).
Bayesian theorem is based on the following Bayesian formula:
P (A | B) increases with the growth of P (A) and P (B | A), and decreases with the growth of P (B, that is, if B is more likely to be observed when it is independent of A, then B's support for a is smaller.
Naive Bayes
The naive Bayes algorithm uses Bayesian formulas to classify features that are independent of each other. See 70173402
The official example code of spark naviebayes is as follows:
Import org. Apache. Spark. ml. Classification. naivebayes
Import org. Apache. Spark. ml. Evaluation. multiclassclassificationevaluator
Import org. Apache. Spark. SQL. sparksession
Object naviebayesdemo {
Def main (ARGs: array [String]): unit = {
Val spark = sparksession
. Builder
. Appname ("naviebayesdemo"). Master ("local ")
. Config ("spark. SQL. Warehouse. dir", "C: \ study \ sparktest ")
. Getorcreate ()
// Load the data stored in libsvm format as a dataframe.
Val dataset = spark. Read. Format ("libsvm"). Load ("Data/mllib/sample_libsvm_data.txt ")
// Split the data into training and Test Sets (30% held out for testing)
Val array (tranningdata, testdata) = dataset. randomsplit (Array (0.7, 0.3), seed = 1234l)
// Train a naviebayes Model
Val model = new naivebayes (). Fit (tranningdata)
// Select example rows to display.
Val predictions = model. Transform (testdata)
Predictions. Show ()
// Select (prediction, true label) and compute Test Error
Val evaluator = new multiclassclassificationevaluator ()
. Setlabelcol ("label ")
. Setpredictioncol ("prediction ")
. Setmetricname ("accuracy ")
Val accuracy = evaluator. Evaluate (predictions)
Println (S "Test Set accuracy = $ accuracy ")
Spark. Stop ()
}
}
The running result is as follows:
18/10/24 11:50:06 info sparkcontext: starting job: collectasmap at multiclassmetrics. scala: 48 + ----- + signature + ----------- + ---------- + | label | features | rawprediction | probability | prediction | + ----- + signature + ------------------ + ----------- + ---------- + | 0.0 | (692, [, 97, 12... | [-173678. 60946628... | [1.0, 0.0] | 0.0 | 0.0 | (692, [100, 99, 1... | [-178107. 24302988... | [1.0, 0.0] | 0.0 | 0.0 | (692, [100,101,102... | [-100020. 80519087... | [1.0, 0.0] | 0.0 | 0.0 | (692, [124,125,126... | [-183521. 85526462... | [1.0, 0.0] | 0.0 | 0.0 | (692, [127,128,129... | [-183004. 12461660... | [1.0, 0.0] | 0.0 | 0.0 | (692, [128,129,130... | [-246722. 96394714... | [1.0, 0.0] | 0.0 | 0.0 | (692, [152,153,154... | [-208696. 01108598... | [1.0, 0.0] | 0.0 | 0.0 | (692, [153,154,155... | [-261509. 59951302... | [1.0, 0.0] | 0.0 | 0.0 | (692, [154,155,156... | [-217654. 71748256... | [1.0, 0.0] | 0.0 | 0.0 | (692, [181,182,183... | [-155287. 07585335... | [1.0, 0.0] | 0.0 | 1.0 | (692, [99,100,101 ,... | [-145981. 83877498... | [0.0, 1.0] | 1.0 | 1.0 | (692, [100,101,102... | [-147685. 13694275... | [0.0, 1.0] | 1.0 | 1.0 | (692, [123,124,125... | [-139521. 98499849... | [0.0, 1.0] | 1.0 | 1.0 | (692, [124,125,126... | [-129375. 46702012... | [0.0, 1.0] | 1.0 | 1.0 | (692, [126,127,128... | [-145809. 08230799... | [0.0, 1.0] | 1.0 | 1.0 | (692, [127,128,129... | [-132670. 15737290... | [0.0, 1.0] | 1.0 | 1.0 | (692, [128,129,130... | [-100206. 72054749... | [0.0, 1.0] | 1.0 | 1.0 | (692, [129,130,131... | [-129639. 09694930... | [0.0, 1.0] | 1.0 | 1.0 | (692, [129,130,131... | [-143628. 65574273... | [0.0, 1.0] | 1.0 | 1.0 | (692, [129,130,131... | [-129238. 74023248... | [0.0, 1.0] | 1.0 | + ----- + -------------------- + rows + ----------- + ---------- + only showing top 20 rows18/10/24 11:50:06 info dagschedwing: job 6 finished: countbyvalue at multiclassmetrics. scala: 42, took 0.157446 stest set accuracy = 1.0
Bayesian, Naive Bayes, and call the spark official mllib naviebayes example