Algorithm grocery stores-Naive Bayes classification of classification algorithms (naive Bayesian classification)

Source: Internet
Author: User

0. Preface

I personally have always liked algorithms. In my opinion, algorithms are the essence of human wisdom, which contains an unparalleled aesthetic. After applying the learned algorithms to practice and solving the actual problems, I cannot find that pleasure in other places.

I have always wanted to write blog posts about algorithms, but I have also written two scattered articles. However, it may be too small compared with engineering articles and it has not aroused everyone's interest. I am looking for a job after graduation. In order to increase my chips and decide to review the algorithm knowledge, I decided to take this opportunity to write a series of articles on algorithms. In this way, the main purpose is to enhance the effect of your review. I think, if you can write the review content into your own articles, it will be more powerful than simply reading and doing the questions, it can also trigger your own thinking. It would be better if you have any interest in it.

In this series, I named it an "algorithm grocery store" because one of these articles is "Miscellaneous ", I will not discuss the basic content of any data structure textbook, such as stacks, linked lists, Binary Trees, searches, and sorting. I will start from a "topic, such as probability algorithms, classification algorithms, NP problems, and genetic algorithms, it may involve algorithms and data structures, discrete mathematics, probability theory, statistics, operations research, data mining, formal language and automation, etc. Therefore, its content structure is like a grocery store. Of course, I will do my best to make the content "messy" as much as possible ".

1.1 Summary

Bayesian classification is a general term for classification algorithms. These algorithms are based on Bayesian theorem and are collectively referred to as Bayesian classification. As the first part of the classification algorithm, this article first introduces the classification problem and formally defines the classification problem. Next, we will introduce the Bayesian theorem, which is the basis of Bayesian classification algorithms. Finally, we use examples to discuss the simplest Bayesian classification: Naive Bayes classification.

1.2 Overview of classification issues

No one is familiar with classification. It is no exaggeration to say that each of us is performing classification operations every day, but we are not aware of it. For example, when you see a stranger, your brain subconsciously determines that TA is male or female; you may often go on the road and say to your friends, "This person is very rich at first glance, there is a non-mainstream". In fact, this is a classification operation.

From a mathematical perspective, the classification problem can be defined as follows:

Known set: And, determine the ing rules so that any and only one of them can be established. (Do not consider Fuzzy Sets in Fuzzy Mathematics)

C is called a category set, where each element is a category, and I is called an item set. Each element is a classified item, and F is called a classifier. The task of the classification algorithm is to construct the classifier F.

It should be emphasized that the classification problem often uses empirical methods to construct ing rules, that is, in general, the classification problem lacks sufficient information to construct 100% correct ing rules, however, by learning empirical data, we can achieve correct classification in a certain probability sense. Therefore, the trained classifier does not necessarily map each item to be classified accurately, the quality of the classifier depends on the classifier construction method, the characteristics of the data to be classified, and the number of training samples.

For example, a doctor diagnoses a patient as a typical classification process. No doctor can directly see the patient's condition. Instead, he can only observe the symptoms and test data of the patient to infer the condition, at this time, the doctor is like a classifier, and the accuracy of the doctor's diagnosis is consistent with the way he was initially educated (construction method) and whether the patient's symptoms are prominent (features of the data to be classified) and the doctor's experience (the number of training samples) are closely related.

1.3 basis of Bayesian classification-Bayesian Theorem

Every time I mention the Bayes Theorem, my reverence is born, not because of the depth of this theorem, but because it is particularly useful. This theorem solves the common problems in real life: knowing the probability of a condition, how to obtain the probability after the exchange of two events, that is, known P (A | B) how to obtain P (B | ). Here we first explain what is conditional probability:

It indicates the probability of event a when event B has occurred. It is called the conditional probability of event a when event B has occurred. The basic formula is :.

Bayesian theorem is useful because we often encounter this situation in our lives: We can easily obtain P (A | B) and P (B | A), which is difficult to obtain directly, but we are more concerned with P (B | A). Bayesian theorem is the way we get P (B | A) from P (A | B.

The Bayesian theorem is provided without proof below:

1.4. Principles and Procedures of Naive Bayes classification 1.4.1 and Naive Bayes Classification

Naive Bayes classification is a very simple classification algorithm called Naive Bayes classification because the idea of this method is really simple. The basic idea of Naive Bayes is as follows: for the given item to be classified, the maximum probability of each category under the condition where the item appears is determined as the category of the item to be classified. In general, it is like this. You saw a black man in the street. I asked you where you guessed this guy. You guessed Africa in. Why? Because the ratio of black people to African people is the highest, of course, people may also be American or Asian people, but without other available information, we will select the category with the highest probability of condition, this is the ideological basis of Naive Bayes.

The formal definition of Naive Bayes classification is as follows:

1. Set it to a feature item to be classified, and each A is a feature attribute of X.

2. There is a set of classes.

3. computing.

4. If yes, then.

Now the key is how to calculate the probability of each condition in step 1. We can do this:

1. Find a set of items to be classified for a known classification. This set is called a training sample set.

2. Obtain the conditional probability estimation for each feature attribute under each category. That is.

3. If each feature attribute is conditional independent, the Bayesian theorem is derived as follows:

Because denominator is a constant for all classes, we only need to maximize the numerator. Because the feature attributes are independent of each other, they include:

According to the above analysis, the process of Naive Bayes classification can be represented (verification is not considered for the time being ):

We can see that the naive Bayes classification is divided into three phases:

The first stage is the preparation stage. The task at this stage is to make necessary preparations for Naive Bayes classification. The main task is to determine the feature attributes based on the actual situation and divide each feature attribute appropriately, then, some items to be classified are manually classified to form a training sample set. The input in this phase is all data to be classified, and the output is the feature attributes and training samples. This stage is the only stage that requires manual completion in the naive Bayes classification. Its quality will have an important impact on the entire process, the classifier quality is largely determined by the feature attributes, feature attribute division, and training sample quality.

The second stage is the Classifier Training phase. The task of this phase is to generate a classifier. The main task is to calculate the occurrence frequency of each category in the training sample and the conditional probability estimation of each feature attribute division for each category, and record the results. The input is the feature attribute and training sample, and the output is the classifier. This stage is a mechanical stage, which can be automatically calculated by the program based on the formula discussed above.

Stage 3: application stage. In this phase, a task uses a classifier to classify classified items. The input is the classifier and the item to be classified, and the output is the ing between the items to be classified and the category. This stage is also a mechanical stage, completed by the program.

1.4.2 conditional probability of feature attribute division under estimating categories and Laplace Calibration

This section discusses the estimation of P (A | Y.

As we can see from the above, calculating the conditional probability P (A | y) of each division is a key step in Naive Bayes classification. When the feature attribute is a discrete value, the frequency of each division in each category in the statistical training sample can be easily used to estimate P (A | Y). The following focuses on the situation where feature attributes are continuous values.

When the feature attribute is a continuous value, it is generally assumed that the value follows the Gaussian distribution (also known as normal distribution ). That is:

While

Therefore, you only need to calculate the mean and standard deviation of the feature item in each category of the training sample, and then use the formula above to obtain the expected estimated value. The Calculation of mean and standard deviation is not described here.

Another question to be discussed is what to do when P (A | y) = 0 and a feature item under a category does not appear, this greatly reduces the classifier quality. To solve this problem, we introduce Laplace calibration. The idea is very simple, that is, to add 1 To the count of all partitions under no category, so that if the number of training sample sets is large enough, it does not affect the results and solves the embarrassing situation where the above frequency is 0.

1.4.3. Naive Bayes classification instance: detects false accounts in the SNS community

The following is an example of how to use Naive Bayes classification to solve practical problems. For the sake of simplicity, We have simplified the data in this example.

This is a common problem for the SNS Community, which is the operator of the SNS community, we hope to detect these non-real accounts, so as to avoid interference with these accounts in some operation analysis reports, and enhance understanding and supervision of the SNS community.

Manual detection requires a lot of manpower and is very inefficient. If an automatic detection mechanism can be introduced, the work efficiency will be greatly improved. To address this problem, we need to classify all accounts in the community into real accounts and non-real accounts. Next we will implement this process step by step.

Set C = 0 to indicate the real account, and c = 1 to indicate the real account.

1. determining feature attributes and Division

In this step, we need to find out the feature attributes that can help us distinguish between real accounts and non-real accounts. In actual applications, there are a large number of feature attributes, and the Division will be more detailed, however, for the sake of simplicity, we use a small number of feature attributes and rough division, and modify the data.

We select three feature attributes: A1: log count/registration days, A2: number of friends/registration days, and A3: whether to use the real avatar. In the SNS community, these three items can be obtained or computed directly from the database.

The following sections describe A1: {A <= 0.05, 0.05 <A <0.2, A> = 0.2}, A1: {A <= 0.1, 0.1 <A <0.8, a> = 0.8}, A3: {A = 0 (not), a = 1 (yes )}.

2. Obtain training samples

Here, we use 10 thousand accounts that have been manually tested by O & M personnel as training samples.

3. Calculate the frequency of each category in the training sample

Divide the number of real accounts and the number of non-real accounts in the training sample by 10 thousand, respectively:

4. Calculate the frequency of feature attribute division under each category

5. Use classifier for identification

Next we use the classifier trained above to identify an account. This account uses a non-real profile picture. The ratio of the number of logs to the number of days of registration is 0.1, and the ratio of the number of friends to the number of days of registration is 0.2.

As you can see, although this user does not use a real Avatar, it is more inclined to classify this account into the real account category through classifier identification. This example also shows the anti-interference of Naive Bayes Classification on individual attributes when there are many feature attributes.

1.5 classifier Evaluation

Although other classification algorithms will be mentioned later, I would like to first mention how to evaluate the quality of classifier.

First, we must define that the classifier accuracy rate refers to the ratio of items correctly classified by the classifier to all classified items.

Regression testing is usually used to evaluate the classifier accuracy. The simplest method is to classify the training data using a constructed classifier, and then evaluate the accuracy based on the results. However, this is not a good method, because training data as the test data may lead to overly optimistic results due to over-fitting. Therefore, a better method is to split the training data into two parts at the initial stage of construction, construct a classifier in one part, and then use the other part to detect the accuracy of the classifier.

This article is based on the signature-non-commercial use
3.0 release of the license agreement. You are welcome to repost and interpret it. However, you must keep the signature of this article, Zhang Yang (including the link), and it shall not be used for commercial purposes. If you have any questions or negotiation with the Authority, please contact me.

From: http://www.cnblogs.com/leoo2sk/archive/2010/09/17/naive-bayesian-classifier.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.