Naive Bayesian classification
Naive Bayesian classification is a very simple classification algorithm, called it naive Bayesian classification is because the idea of this method is really very simple, naive Bayesian's ideological foundation is this: for the given to be classified, the probability of the occurrence of the conditions under this term, which is the largest, think of the classification of the category belongs to. In layman's terms, like this, you see a black man on the street, and I ask you, guess where this guy came from, you're going to guess Africa. Why is it? Because blacks have the highest rates of Africans, of course, they may also be American or Asian, but with no other information available, we will choose the category with the most conditional probabilities, which is the ideological foundation of naive Bayes.
The formal definition of Naive Bayes classification is as follows:
1, set as one to be classified, and each A is a characteristic attribute of x.
2, there is a category collection.
3, calculation.
4, if, then.
So the key now is how to calculate the probability of each condition in the 3rd step. We can do this:
1, find a known classification of the set of items to be categorized, this set is called the training sample set.
2. The statistic gets the conditional probability estimate of each characteristic attribute in each category. That
3, if each characteristic attribute is condition independent, then according to Bayes theorem has the following derivation:
because the denominator is constant for all categories, as long as we can maximize the numerator. And because each characteristic attribute is conditionally independent, there are:
As you can see, the entire naive Bayesian classification is divided into three stages:
The first stage-the preparatory stage, the task of this stage is to make the necessary preparation for naive Bayesian classification, the main work is to determine the characteristic attributes according to the specific situation, and the appropriate division of each feature attribute, and then manually classify a portion of the items to be classified, forming a training sample set. The input for this stage is all data to be classified, and the output is the feature attribute and training sample. This phase is the only stage in the whole naive Bayesian classification that needs to be completed manually, and its quality will have an important influence on the whole process, the quality of classifier is determined by characteristic attribute, characteristic attribute division and Training sample quality to a great extent.
The second stage-the classifier training phase, the task of this stage is to generate the classifier, the main work is to calculate the frequency of each category in the training samples and each feature attribute division of each category of the conditional probability estimates, and the results recorded. The input is the characteristic attribute and the training sample, and the output is the classifier. This stage is a mechanical phase, according to the formula discussed above can be completed automatically by the program.
The third stage-the application phase. The task at this stage is to classify the classification items using classifiers, whose input is the classifier and the item to be categorized, and the output is the mapping between the categories and the category. This stage is also a mechanical phase, completed by the program.
Another issue that needs to be discussed is what happens when P (a|y) =0, which occurs when a feature entry is not present in a category, which results in a greatly reduced classifier quality. To solve this problem, we introduced the Laplace calibration, which is very simple, the idea is to add 1 to the count of all the divisions under no category, so that if the number of training samples is sufficiently large, it will not affect the results and solve the embarrassing situation of the above frequency of 0.
Reference: Algorithm grocer--naive Bayesian classification of classification algorithms (Naive Bayesian classification)
Naive Bayesian classification