Machine Learning (iv): The simplicity of the classification algorithm Bayesian _ machine learning

Source: Internet
Author: User

This paper is organized from the "machine learning combat" and Http://write.blog.csdn.net/postedit
Basic Principles of Mathematics:

Very simply, the Bayes formula:



Base of thought:

For an object to be sorted x, the probability that the thing belongs to each category Y1,y2, which is the most probability, think that the thing belongs to which category.
Algorithm process:

1. Suppose something to be sorted x, it has the A1,A2,... am, the M feature attribute

2. There are y1,y2,..., yn of these n categories

3. Calculation.

4. If so, then.

The key of the whole algorithm is the third step, which is to find out the probability of each category.


Find the probability of each category:

Using the Bayesian formula, you can ask:

1. According to the training sample set (that is, already know the category of things), estimate the various characteristics of different attributes of the conditional probability. That

2. According to the Bayesian theorem there are the following derivation: (Naive Bayesian does not consider the relationship between the various attributes, each attribute as conditional independent)

The denominator p (x) is constant, so only the maximum value of the molecule is required.

While P (AJ | yi) has been estimated by training samples, P (Yi) can be clearly calculated according to the number of categories.
How to estimate p (AJ | yi):

1. When the feature attribute is a discrete value:

It is only necessary to simply calculate the number of cases in which AJ, the category of the training sample, is corresponding to the partition set.

2. When the feature attribute is a continuous value:

Suppose the value of the feature attribute obeys the normal distribution. That

Therefore, as long as the average and standard deviation of the probability of each category in the training sample is calculated (that is, the mean and standard deviation of the probabilities belonging to the various categories), the estimated value is obtained by substituting the above formula.

How to deal with P (Aj|yi) =0.

Introduction of Laplace Calibration:

Add 1 to the count of all divisions under each category, so that if the number of training samples is sufficiently large, it will not affect the result and solve the embarrassing situation of the probability of 0.
Second, the application
Detect the Unreal account in SNS community

The following is an example of using naive Bayesian classification to solve practical problems, and for simplicity, the data in the example is simplified appropriately.

The problem is this, for the SNS community, the Unreal account (use false identity or the user's trumpet) is a widespread problem, as the SNS community operators, hope can detect these unreal account, so in some operation analysis report to avoid these account interference, We can also strengthen the understanding and supervision of SNS community.

If through pure artificial detection, need to expend a lot of manpower, efficiency is very low, if can introduce the automatic detection mechanism, will greatly enhance the work efficiency. This question is plainly, is to the community all accounts in the real account and unreal account two categories of classification, below we step by step to achieve this process.

First set c=0 to represent the real account, C=1 represents an untrue account.

1. Determine feature attributes and division

This step is to identify features that can help us distinguish between real and unreal accounts. In practical applications, the number of feature attributes is many, the division will be more detailed, but for the sake of simplicity, we use a small number of feature attributes and thicker division, and the data have been modified.

We selected three feature attributes: A1: Number of Log/registration days, A2: Buddy number/Registration days, A3: whether to use the real avatar. In the SNS community, these three items can be obtained or calculated directly from the database.

The following gives the division: a1:{a<=0.05, 0.05<a<0.2, a>=0.2},a1:{a<=0.1, 0.1<a<0.8, a>=0.8},a3:{a=0 (not), a=1 (yes)}.

2. Get Training samples

This is used as a training sample of 10,000 accounts that have been manually tested by the operation and maintenance personnel.

3. Calculate the frequency of each category in the training sample

The number of real and unreal accounts in the training samples is divided by 10,000 to get:

4. Calculate the frequency of each feature attribute division under each category condition

5, the use of classifier to identify

Below we use the above trained classifier to identify an account, this account uses unreal Avatar, the ratio of log number to registration days is 0.1, and the ratio of friends to registered days is 0.2.

You can see that although the user does not use the real avatar, but through the classifier identification, more inclined to put this account into the real account category. This example also demonstrates the anti-interference of naive Bayesian classification to individual properties when the feature attribute is sufficiently long.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.