Naive Bayesian classifier is a set of simple and fast classification algorithms. There are many articles on the Internet, such as this one is relatively good: 60140664. Here, I'm going to sort it out as I understand it.
In machine learning, we sometimes need to solve classification problems. That is, given a sample's eigenvalues (Feature1,feature2,... feauren), we want to know which category label the sample belongs to (Label1,label2,... Labeln). That is: We want to know what the conditional probability P (label|features) is for each label of the sample, so that we can know which classification the sample belongs to. For example: Suppose a DataSet has 2 categories (labels), and if a new sample appears now, its P (label1|features) >p (label2|features), then we can determine that the label of the sample is Label1.
So how does p (label|features) calculate? If the data is a discrete value, then P (label|features) can be calculated. But if the overall data is very large, we can not one to observe the record, only random sampling, through the sample to estimate the overall. Or what if the data is a continuous value? So we can learn by training set data {(FEATURE1,LABEL1), (Feature2,label2), ... (Featuren,labeln)} , the P (label|features) is estimated.
There are three estimation methods, namely maximum likelihood estimation (MLE), Bayesian estimation (Bayesian), and maximal posteriori probability estimation (MAP). (Photo excerpt: https://www.cnblogs.com/little-YTMM/p/5399532.html)
(i) Maximum likelihood estimation method (MLE, maximum likelihood estimation)
belong to the frequency school, think there is a unique truth θ.
If we sample the population, we assume that the sampled data conforms to a certain distribution (for example, a normal distribution), but we do not know the parameter θ of the distribution (for example, the mean, standard deviation), and the maximum likelihood estimation method is to find the most probable parameter θ that produces the sample data of the model , That is to find. Because of the multiplication operation, usually the likelihood function takes the logarithm computation, can turn the multiplication into the summation, then the derivation, takes the derivative to 0 extremum point, is wants to find the parameter value.
(Note: This means that although naive Bayes has three characters in it, we can use naive Bayesian models rather than Bayesian methods.) )
However, the maximum likelihood estimate is only suitable for the case of large data volume. If the amount of data is small, the result is likely to be biased. For example, if you toss a uniform coin 10 times, 7 times face up, 3 times opposite (assuming P (Head) obeys beta distribution). So the beta distribution function is that the function is maximized when x=0.7. So can we say P (Head) =0.7? This result is certainly not allowed, because we all know P (Head) =p (Tail) = 0.5. But if we toss this coin 1000 times, the result will be more accurate. But many times, we can't do so many experiments. In this respect, the solution is Bayesian estimation method.
(ii) Bayesian estimation method (Bayesian estimation)
Belong to the Bayesian school, think that Theta is a random variable, accord with a certain probability distribution.
Or the overall sampling, we assume that the sampled data conform to a certain distribution, and according to previous experience, we know the probability distribution of the parameter θ (P (θ) is also a priori probability), we are based on Bayesian theorem, so through the study conditional probability P (d|θ) distribution, we can calculate the posterior probability p (θ| D) Distribution. Considering all possible θ for the new sample prediction, the best prediction results can be obtained.
Or with the example of tossing a coin above, we still assume that P (head) obeys the beta distribution, and we know that each toss of a coin obeys two distributions (the coin is not facing upward or opposite, so p (θ) =0.5), and by calculation we can get P (head) Distributed between 0.2 and 1 (see this article: http://www.360doc.com/content/17/1002/23/31429017_691875200.shtml), when the beta function peaks at x=0.6. By adding a priori probability, we can obtain a more accurate prediction result. This indicates that Bayesian estimation can be used in cases where the data volume is less or more sparse.
However, we find that the Bayesian estimation method solves the problem of less data, but it brings new problems. Since the Bayesian estimation method is used to solve the problem, we let the parameter θ obey some probability density function distribution, which causes the computational process to be highly complex. In order to calculate the convenience, it is proposed to no longer put all the posteriori probabilityp(θ| D) to find out, but still use similar to the maximum likelihood estimation of the idea, to find the maximum posteriori probability, this simple and effective method is called the maximal posterior probability estimation method.
(iii) Maximum posteriori probability estimation method (MAP, maximum a posterior probability estimation)
The maximal posteriori probability estimation method is similar to the maximum likelihood estimation method, except that a priori probability p (θ) is added, which is equivalent to adding a penalty (regularization) to reduce the deviation.
(Note: If the priori probability p (θ) is evenly distributed, then the maximal posteriori probability estimation method is equivalent to the maximum likelihood estimation method.) )
Let me deduce the whole process of the maximal posteriori probability estimation method:
According to the Bayes theorem: We take the characteristics of the dataset (features) and the tag (label) in it to get:.
Because it is a constant (constant), the formula can be rewritten as:. (∝ expressed in direct proportion)
Naive Bayes assumes that each feature is independent of each other, so there is. (Note: Naive Bayes is called "plain" because it assumes that each feature is independent of each other.) )
At this point, the formula can be written as:.
Also namely:.
This is to say: we need to learn the prior probability p (label) and the distribution of conditional probability P (features|label), get the joint distribution of P (Features,label), and then deduce the posterior probability p (label|features) distribution.
(Note:)
Learning means estimating its distribution. At this point, the distribution of P (label) and P (Features|label) can be estimated by the maximum likelihood estimation method, and then the best estimate for P (label|features) is:.
In other words, the naive Bayesian classifier classifies the sample into the largest classification of the posterior probability P (label|featurees).
(Note that a feature of a new sample may never have occurred in the training set, its conditional probability p (Features|label) becomes 0, which causes P (label|features) to also be 0, which is obviously not true.) The solution is to introduce Laplace smoothing (Laplace smoothing). For example, when classifying text, we treat the frequency of a word as a feature and estimate its conditional probability P (features|label) with the maximum likelihood estimation method: In a category (label), the frequency at which the word appears or the frequency at which all words appear. With the introduction of Laplace smoothing (Laplace smoothing), the estimation of its conditional probability P (features|label) becomes: (in a category (label), the word appears in the frequency + 1)/(the frequency of all words and the number of categories). This way, when the amount of data in the training set is sufficiently large, it does not affect the results and solves the problem of a frequency of 0. )
Naive Bayes classifiers can be classified into different types based on different assumptions about the distribution of the data set P (Features|label), and the following are three common types:
1. Gaussian naive Bayes (Gaussian Naive Bayes)---assumes that the characteristic is a continuous value and conforms to the Gaussian distribution. Formula:
2. Polynomial naive Bayes (multinomial Naive Bayes)---Assume that eigenvectors are generated by polynomial distributions. Formula:
3. Bernoulli naive Bayes (Bernoulli Naive Bayes)---assumes that the feature is a separate boolean (binary variable) type. Formula:
Advantages: 1. The speed of training and prediction is very fast (because each feature is independent, so the distribution of each conditional probability P (feature|label) can be independently estimated by one-dimensional distribution)
2. Easy to explain
3. Less adjustable parameters
4. Although the hypothesis that the naive Bayesian model is independent of the characteristics is often not established in practical applications, its classification effect is still good.
Cons: 1. Because the naive Bayesian classifier has strict assumptions about the data distribution, its predictive effect is usually worse than that of the complex model.
Applies to: 1. High degree of distinction in each category
2. Data sets with very high dimensions
3. Provide fast and rough basic solutions for classification problems
Classic Applications: Document classification, spam filtering (spam filtering)
Machine learning---Naive bayesian classifier (machines learning Naive Bayes Classifier)