1. Preparation:
(1) Prior probability: Based on past experience and analysis of the probability, that is, the usual probability, in the full probability of the expression is "from the result of the fruit"
(2) Posterior probability: refers to the probability of re-correcting after obtaining the "result" information, usually the conditional probability (but not all of the conditional probability is the posterior probability), in the Bayesian formula as a result of the "fruit seeking cause"
For example: processing a batch of parts, a processing 60%, b processing 40%, a 0.1 of the probability of processing a defective, b 0.15 of the probability of processing a defective, to find a part is the probability of a defective is a priori probability, has learned that a part is defective, to find this part is a or b processing probability is a posteriori probability
(3) Full probability formula: Set E for random test, b1,b2, .... BN is an incompatible random event of E, and P (Bi) >0 (I=1,2....N), B1 u B2 u .... U Bn = S, if A is an E event, then there is
P (A) = P (B1) p (a| B1) +p (B2) P (a| B2) +.....+p (Bn) P (a| Bn)
(4) Bayesian formula: set E for randomized trials, B1,B2, .... BN is an incompatible random event of E, and P (Bi) >0 (I=1,2....N), B1 u B2 u .... U Bn = s,e event A satisfies P (A) >0, then there is
P (bi| A) = P (Bi) p (a| Bi)/(P (B1) p (a| B1) +p (B2) P (a| B2) +.....+p (Bn) P (a| Bn))
(5) Conditional probability formula: P (a| b) = P (AB)/p (b)
(6) Maximum likelihood estimation: Maximum likelihood estimation in machine learning to minimize empirical risk, (discrete distribution) general flow: Determine likelihood function (the joint probability distribution of a sample), this function is about the parameter to be estimated the function, and then take the logarithm, and then derivative, in the case that the derivative equals 0, Evaluates the value of the parameter, which is the maximum likelihood estimate of the parameter
Note: Empirical risk: In the measurement of a model is good or bad, the loss function is introduced, the common loss function is: 0-1 loss function, square loss function, absolute loss function, logarithmic loss function, and the risk function (expected risk) is the expectation of the loss function, the expected risk is about the joint distribution of the theoretical expectations, However, the joint distribution of theory can not be obtained, only by using the sample to estimate the expectation, so introducing empirical risk, empirical risk is the average loss of the sample, according to the large number theorem, when the sample tends to infinity, the experience risk will be infinitely close to the expected risk.
2. Naive Bayesian algorithm
(1) Idea: The simplicity of naive Bayesian algorithm lies in the idea that the individual elements of the input vector (X1, X2,...., Xn) are independent of each other, so the probability P (x1=x1,x2=x2,.... XN=XN) =p (x1=x1) P (x2=x2) ... P (XN=XN), secondly, based on Bayesian theorem, for a given training data set, first, based on the characteristic condition independent hypothesis learning joint probability distribution, then based on this model, for a given input vector, using Bayesian formula to find the most posterior probability output classification label
(2) Details: To determine the type of input vector x of the calculation process to specify the naïve Bayesian computation process
<1> to calculate the class of input vector x, which is the probability of y in the condition of x, when y takes the maximum probability of a value, then the value is the classification of x, then the probability is P (y=ck| X=X)
<2> using conditional probability formula to derive Bayesian formula (this step is not necessary, I am accustomed to remember the Bayesian formula)
By the conditional probability formula get P (y=ck| x=x) = P (y=ck,x=x)/P (x=x) = P (x=x | Y=CK) P (y=ck)/P (x=x)
The full probability formula is available (replace P (x=x)):
<3> because of naive Bayes ' simplicity ', the eigenvector is independent of each other, so the following formula can be obtained:
<4> bring the formula in <3> into the <2> Bayesian formula to get:
<5> the denominator of the type, for the given input vector x, and all the values of Y, all used, in detail, whether it is calculated in the vector x condition of any one of the Y value ck,k=1,2....k, vector and c1.....ck are used, so the impact P (y=ck| X=X) size only molecules work, so you get
Note: Argmax refers to CK with the largest probability of taking
<6> actually the whole process of <5> naive Bayes is complete, but the P (y=ck) and P (X (J) =x (j) | Y=CK) does not say that the two solutions are based on the maximum likelihood estimation method to the probability, that is, the following formula:
One of the I (..) is the indicator function, of course, these probabilities in the actual can be very block, you can see the following a question, after reading the two probabilities are how to beg, the formula derivation process does not repeat (I am not very clear, but as similar to the two-item distribution of maximum likelihood evaluation)
3, the question-----a look on the top of the string up (direct mapping)
[Machine learning & Data mining] naive Bayesian mathematical principles