Algorithm grocer--naive Bayesian classification of classification algorithm (Naive Bayesian classification)

0, written in front of the words

I personally always like the algorithm a kind of things, in my opinion algorithm is the essence of human wisdom, which contains an incomparable beauty. And every time the algorithm is applied to the actual, and solve the actual problem, that kind of pleasure is what I do not feel in other places.

Always wanted to write about the algorithm blog, also wrote a scattered two, but perhaps compared to the engineering article is too small, and did not arouse everyone's interest. Recently faced with graduation to find a job, in order to increase their chips, decided to review the knowledge of the algorithm, I decided to take this opportunity to write a series of articles about the algorithm. To do so, mainly in order to strengthen their review of the effect, I think, if you can review things with their own understanding of the article, is bound to be more than the simple book to master the more solid, but also to trigger their own thinking. If you can have some interesting friends to gain, it is better.

This series I named "Algorithm Grocer", the reason is that these articles a major feature is "Miscellaneous", I will not specifically discuss the stack, linked list, binary tree, find, sort and so on any of the data structure textbook will talk about the basic content, I will start from a "topic", such as probability algorithm, classification algorithm, NP problem, Genetic algorithm and so on, and then do an extension, may involve the algorithm and data structure, discrete mathematics, probability theory, statistics, operations research, data mining, formal language and automata and many other aspects, so its content structure is like a grocery store. Of course, I will try my best to make the contents "Miscellaneous and not messy".

1.1. Abstract

Bayesian classification is a generic term for a class of classification algorithms, which are based on Bayesian theorem, so collectively referred to as Bayesian classification. As the first chapter of classification algorithm, this article will introduce the classification problem, and make a formal definition of classification. Then, the Bayesian theorem, the basis of Bayesian classification algorithm, is introduced. Finally, the simplest of Bayesian classification is discussed by an example: naive Bayesian classification.

1.2. Summary of classification problems

For the classification problem, in fact, no one is unfamiliar, said each of us in the implementation of the classification operation is not exaggerated, but we do not realize it. For example, when you see a stranger, your brain subconsciously determines that TA is a male or female; You may often walk on the road to the friends around said, "This person is very rich, there is a non-mainstream", in fact, this is a classification operation.

From a mathematical point of view, the classification problem can be defined as follows:

Known collections: And, determine the mapping rules so that any and only one of them is established. (without considering the fuzzy set in fuzzy mathematics)

Where c is called a collection of categories, where each element is a category, and I is called a collection of items, each of which is an item to be categorized, and F is called a classifier. The task of the classification algorithm is to construct the classifier F.

It is emphasized that the classification problem often uses empirical method to construct mapping rules, that is, the classification problem in general is lack of enough information to construct 100% correct mapping rules, but to realize the correct classification in certain probability sense by learning the empirical data. Therefore, the classifier is not necessarily able to accurately map each of the categories to its classification, the quality of the classifier and classifier construction method, the characteristics of the data to be classified and the number of training samples, and many other factors.

For example, a doctor's diagnosis of a patient is a typical classification process, no one doctor can directly see the patient's condition, only to observe the patient's symptoms and a variety of laboratory testing data to infer the condition, then the doctor is like a classifier, and this doctor diagnosis accuracy rate, It was closely related to the way in which he had been educated (the method of construction), whether the patient's symptoms were prominent (the characteristics of the data to be classified), and how much the Doctor's experience was (the number of training samples).

1.3. The basis of Bayesian classification--Bayes theorem

Every time I mention the Bayes theorem, I have a feeling of reverence in my heart, not because of the depth of the theorem, but because it is particularly useful. This theorem solves the problems that are often encountered in real life: the probability of a given conditional probability, how to get two events after the exchange of probabilities, that is, in the known P (a| B) In the case of how to obtain P (b| A). Here we first explain what the conditional probabilities are:

The probability that event a takes place under the premise that event B has occurred is called the conditional probability of event a under event B. Its basic solution formula is:.

The Bayes theorem is useful because we often encounter this situation in our lives: we can easily derive P (a| B), P (b| A) is difficult to draw directly, but we are more concerned about P (b| A), Bayesian theorem for us to get through the P (a| B) Get P (b| A) of the road.

The Bayesian theorem is directly given below without proof:

1.4. Principle and flow of naive Bayesian classification 1.4.1 and naive Bayesian classification

Naive Bayesian classification is a very simple classification algorithm, called it naive Bayesian classification is because the idea of this method is really very simple, naive Bayesian's ideological foundation is this: for the given to be classified, the probability of the occurrence of the conditions under this term, which is the largest, think of the classification of the category belongs to. In layman's terms, like this, you see a black man on the street, and I ask you, guess where this guy came from, you're going to guess Africa. Why is it? Because blacks have the highest rates of Africans, of course, they may also be American or Asian, but with no other information available, we will choose the category with the most conditional probabilities, which is the ideological foundation of naive Bayes.

The formal definition of Naive Bayes classification is as follows:

1, set as one to be classified, and each A is a characteristic attribute of x.

2, there is a category collection.

3, calculation.

4, if, then.

So the key now is how to calculate the probability of each condition in the 3rd step. We can do this:

1, find a known classification of the set of items to be categorized, this set is called the training sample set.

2. The statistic gets the conditional probability estimate of each characteristic attribute in each category. That

3, if each characteristic attribute is condition independent, then according to Bayes theorem has the following derivation:

Because the denominator is constant for all categories, as long as we can maximize the numerator. And because each characteristic attribute is conditionally independent, there are:

Based on the above analysis, the naïve Bayesian classification process can be represented by a representation (for the time being, no validation is considered):

As you can see, the entire naive Bayesian classification is divided into three stages:

The first stage-the preparatory stage, the task of this stage is to make the necessary preparation for naive Bayesian classification, the main work is to determine the characteristic attributes according to the specific situation, and the appropriate division of each feature attribute, and then manually classify a portion of the items to be classified, forming a training sample set. The input for this stage is all data to be classified, and the output is the feature attribute and training sample. This phase is the only stage in the whole naive Bayesian classification that needs to be completed manually, and its quality will have an important influence on the whole process, the quality of classifier is determined by characteristic attribute, characteristic attribute division and Training sample quality to a great extent.

The second stage-the classifier training phase, the task of this stage is to generate the classifier, the main work is to calculate the frequency of each category in the training samples and each feature attribute division of each category of the conditional probability estimates, and the results recorded. The input is the characteristic attribute and the training sample, and the output is the classifier. This stage is a mechanical phase, according to the formula discussed above can be completed automatically by the program.

The third stage-the application phase. The task at this stage is to classify the classification items using classifiers, whose input is the classifier and the item to be categorized, and the output is the mapping between the categories and the category. This stage is also a mechanical phase, completed by the program.

Conditional probability and Laplace calibration of feature attribute partitioning under 1.4.2 and estimation categories

This section discusses the estimates for P (a|y).

As can be seen from the above, the calculation of the conditional probability P (a|y) of each division is the key step of naive Bayesian classification, when the characteristic attribute is discrete value, as long as the very convenient statistical training sample of each division in each category in the frequency of the occurrence to be used to estimate P (a|y), the following focus on the characteristics of continuous values.

When a feature attribute is a continuous value, it is generally assumed that its value is subject to a Gaussian distribution (also known as a normal distribution). That

and

Therefore, as long as the average and standard deviation of this feature is calculated in each category of the training sample, the desired estimate can be obtained by substituting the above formula. The calculation of the mean and standard deviation is not mentioned here.

Another issue that needs to be discussed is what happens when P (a|y) =0, which occurs when a feature entry is not present in a category, which results in a greatly reduced classifier quality. To solve this problem, we introduced the Laplace calibration, which is very simple, the idea is to add 1 to the count of all the divisions under no category, so that if the number of training samples is sufficiently large, it will not affect the results and solve the embarrassing situation of the above frequency of 0.

1.4.3, naive Bayesian classification example: Detecting the Unreal account in SNS community

The following is an example of using naive Bayesian classification to solve real-world problems, and for simplicity, the data in the example is simplified appropriately.

This is the problem, for the SNS community, the real account (using false identity or the user's trumpet) is a common problem, as the SNS community operators, want to detect these unreal accounts, in some operations analysis report to avoid these account interference, We can also strengthen the understanding and supervision of SNS community.

If the pure artificial detection, the need to spend a lot of manpower, efficiency is very low, such as the introduction of automatic detection mechanism, will greatly improve work efficiency. The question is plainly to classify all the accounts in the community in two categories of real accounts and unreal accounts, and let's step through the process.

First set c=0 to represent the real account, C=1 represents an unreal account.

*1. Defining feature attributes and partitioning*

This step to find that can help us distinguish between real account and unreal account characteristics, in the actual application, the number of feature attributes are many, the division will be more detailed, but here for the sake of simplicity, we use a small number of feature attributes and Coarse division, and the data have been modified.

We chose three feature attributes: A1: Number of logs/number of registrations, A2: Number of Friends/registration days, A3: whether to use a real avatar. In the SNS community, these three items can be obtained or calculated directly from the database.

The following are the divisions: a1:{a<=0.05, 0.05<a<0.2, a>=0.2},a1:{a<=0.1, 0.1<a<0.8, a>=0.8},a3:{a=0 (not), a=1 (yes)}.

*2. Get Training Samples*

This is a training sample using 10,000 accounts that have been manually tested by OPS personnel.

*3. Calculate the frequency of each category in the training sample*

The number of real and unreal accounts in the training sample is divided by 10,000 respectively to get:

*4. Calculate the frequency of each characteristic attribute partition under each category condition*

*5, using the classifier for identification*

Below we use the above-trained classifier to identify an account, this account uses a non-real picture, the number of logs and registration days ratio of 0.1, the number of friends and registered days ratio of 0.2.

As you can see, although the user does not use the real picture, but through the identification of the classifier, it is more inclined to put this account into the real account category. This example also shows the anti-jamming nature of naive Bayesian classification for individual properties when the feature attributes are sufficiently long.

1.5. Evaluation of classifiers

Although other classification algorithms will be mentioned later, I would like to mention here first how to evaluate the quality of the classifier.

The first thing to define is that the correct rate of the classifier is the rate at which the classifier correctly classifies the items that are classified.

Regression tests are often used to evaluate the accuracy of classifiers, and the simplest method is to classify the training data using a constructed classifier and then give the correct rate assessment based on the results. But this is not a good way, because the use of training data as detection data may be too well-fitting to lead to overly optimistic results, so a better way is to split the training data in the early stages of construction, with a part of the construction of the classifier, and then use the other part to detect the accuracy of the classifier.

Algorithm grocer--naive Bayesian classification of classification algorithm (Naive Bayesian classification)