Re-understanding decision Tree Series algorithm and logical return (i)

Last Update:2016-07-17 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, decision tree popular to in-depth understanding

We know that decision trees can be used to classify, and also to be used for regression, we mainly use in the classification of the situation, the regression is actually similar.

For example, if a bank wants to determine whether to send a credit card to a user, it will determine whether or not to send it to the user based on the user's basic information, assuming we know the user's information as follows:

Age

Whether there is a job

Whether to have a house of their own

Credit situation

Gender

Youth, middle age, old age

Yes, no

Poor, very poor, general, good, very good

Male, female

The result of the classification is of course: credit card, no credit card

If we learn the logic regression algorithm, we know that it is to take these characteristics and then bring into the SIGMOD function, to find a probability, and then according to the probability of the size of the classification of the results.

So how does a decision tree be classified?

The decision tree is very similar to the thought of human beings making decisions, we consider the values of each characteristic of the current sample, filter by screen, and finally give a decision result, the process of which is as follows:

For example, the bank must first see if the person has a room, if there is no house, and then see the person's age at what stage, if it is a youth, it has the potential to send him a credit card, if it is middle-aged, the representative may not be able to repay the credit card debt, do not send a credit card.

Decision tree Decision-making process, in fact, is a combination of if-then rules, from the root node to the leaf node, the equivalent of a rule of the path.

We can also understand its theoretical significance from another angle, and the decision tree also represents the conditional probability distribution of a class under a given characteristic condition. Explain the reasons below, we know that any machine learning model needs to learn the parameters of the model from the sample, the decision tree is also no exception, when the sample training will be a one to fall into the leaf node, above we have said from the root node to the leaf node is a combination of rules, the rules are actually conditions, that is to say , the samples that fall into the same leaf node are the original sample subsets that conform to a subset of the samples under several conditions (rule path), then the sample falling into the same leaf node may not belong to a particular class, but certainly most of them are the same category, and only a few are different categories, assuming that our class is in the case of two classes, There may be 10 samples fall into the leaf node, 8 are hair, only two are not hair, then this leaf node category is hair. In other words, the same leaf node can be explained by these samples, under the same conditions, the probability of the hair is 80%, the probability of not sending is 20%, we choose the category of the highest probability of the condition as the leaf node category, which is the decision tree in the relationship between conditional probability.

Second, feature selection

Hangyuan Li's "Statistical learning method," said, the method = model + strategy + algorithm, the model is actually what we call the hypothetical function, such as linear regression model, the logic of the regression model, decision tree model, etc., the strategy is loss function, mean variance, maximum entropy and so on, the algorithm is to optimize the loss function method, such as gradient drop, Quasi-Newton and so on.

In the above discussion, we know that the decision tree is a bifurcation of a lot of intermediate nodes, and finally fall into the leaf node, output results. So the question is, how do we build a decision tree? What is its loss function? To understand the loss function of a decision tree, let's first look at how to build a decision tree.

The algorithm of building decision tree is usually the best feature of recursive selection, and according to this feature, the training data is segmented, each sub-dataset has a best classification process, if the sub-datasets can be classified correctly, then directly return the leaf nodes, if they are not well separated, Then continue to select the best features to split it, know that all the samples are basically correctly categorized.

The above description is too formal, that is, we first choose a good feature, the sample is divided, as far as possible to make a sample of the same class into a piece, if the partition after a piece of the sample basically belong to the same class, then directly the sample node as a leaf node, if not divided well, Then continue to divide the same method until all the samples fall into the leaf node.

Two questions: What is the optimal trait and what is the best measure?

That introduced the next section, Information gain (last week to participate in Baidu School recruit, was interviewed asked what this noun called, helpless I only know the principle, forget the professional noun called information gain, hey).

Third, the theoretical information gain of feature selection

We first understand the entropy, then introduce the conditional entropy, and finally elicit the information gain, and improve the version information gain ratio.

A) entropy

The concept of entropy: entropy represents the measurement of uncertainty in random variables. Popular thinking, how to measure the randomness of a variable? For example, tossing a coin, we know that the probability of the positive and negative is equal to 0.5, if you want to guess, you will say, no, randomness is too big, entirely by luck. But if I tell you, the odds of a positive are 0.8, the odds of the reverse are 0.2, so you guess, what do you guess? You will think, I certainly guess positive, although there is a certain randomness, but the randomness did not just 0.5 of the big ah, I guess the positive guess the odds are big, if I tell you the probability of positive is 100%, you still think it is random, of course, no, absolutely sure, 0%? Think for yourself.

In layman's terms, the more uniform the probability distribution of random variables, the greater the randomness.

Entropy is defined from this angle, let's look at the mathematical definition of entropy.

Set X is a discrete random variable that takes a finite number of values, and its probability distribution is:

Then the entropy of the random variable is:

The definition of this entropy formula is very much in line with the entropy we just talked about. Suppose that the log in the equation is based on 2, and when all is equal, the entropy takes the most, the same as 0 or 1 o'clock, the entropy takes 0

b) Conditional entropy

Understanding the definition of entropy, and then understanding the conditional entropy is much simpler, assuming that there are two random variables (x, y), the joint probability distribution is

Conditional entropy refers to the uncertainty of Y in the case where X has been determined, which is very simple. However, it is important to note that the exact definition is the mathematical expectation of x for the conditional probability distribution of y under a given condition, that is, the entropy of the probability distribution of y when the value of x does not pass, and what is expected of them is actually the weighted sum.

Note: When the probability of our entropy and conditional entropy is obtained by data estimation (in particular, maximum likelihood estimation) (which is actually obtained in the sample), the corresponding entropy and conditional entropy become respectively empirical entropy and empirical conditional entropy.

c) Information gain

Big recruit, what is the information gain? The information gain indicates the degree to which the uncertainty of the information of Class Y is reduced after knowing the information of X. Understand the meaning of this sentence is to combine the entropy and conditional entropy, do not know X when the uncertainty of Y is the entropy, then know that x after the measurement of Y uncertainty is conditional entropy, the difference between the two is the X-y information gain, that is, after knowing X, the uncertainty of Y is reduced by how much.

Just like you want to judge a person's good or bad, when you do not know this person's information, you can not judge, only randomly guessed, when you learn from others about this person's information, you will not be completely random guess, then you from other people's information is the condition, You are judged on the basis of what is known about these conditions, and the amount of the reduction in randomness is information gain.

The mathematical formula means:

Note that the D here refers to the training sample, and refers to the empirical entropy and the empirical condition entropy, respectively.

Returning to the title question, how do we choose features in the decision tree construction process? What kind of characteristics are good? Intuitively, a good feature, should be able to divide the sample of each block of the class is very pure, that is, each block is basically the same category, then what is the relationship with information gain?

Imagine, if we choose a feature, we choose a feature that is not related to the distribution of the sample, after dividing each block is still disorganized, what categories have, and when a good feature is chosen, each block after the partition is no longer disorganized, but that certain categories in the number of the block will become larger, The corresponding other categories will become smaller. In this process, the selection of features and the process of dividing down, is the process of conditional determination, then we are no longer a disorganized category after division, the corresponding entropy will be smaller, note that the entropy here is conditional entropy.

Here you should understand, we divide the original random sample, divided into a relatively pure block, the better division, become more pure, the better, quantify it is our information gain.

Next, we look at the mathematical description is not difficult.

Set training data Set d,| d| indicates its sample size, which is the number of samples. With k categories, k = 1,2....,k, | | The number of samples that belong to. Set feature A has n different values {A1,A2,..... an}, according to the value of the characteristic A to divide D into N own d1,d2, ... Dn, which is the number of samples, i.e.. A collection of samples belonging to a class in a subset is, that is.

So the information gain of feature a in training dataset D is

which

The formula inside is seemingly complex, in fact, is the probability of calculation, it is not difficult to understand.

d) Information gain ratio

In the above we say, according to the information gain to select the characteristics, but this principle has a problem, is biased to select the characteristics of more features, how to understand it, the more the eigenvalues, then the number of blocks according to the characteristics of the sample divided more, then the different categories of the probability of dividing open the greater, in turn you think, If the eigenvalues of a sample are only two, it is difficult to divide the sample into pure by dividing the original sample according to the eigenvalues.

In order to solve this problem, we have the concept of information gain ratio, we choose the feature according to the size of information gain ratio.

Information gain ratio:

Note that this is not the empirical entropy of the original sample distribution on the category, but the entropy of the original sample on the feature a, which is that you take the value of the feature a as a category, and then find the empirical entropy of the sample set, which is different from the empirical condition entropy, and pay attention to the distinction.

, n is the number of values of a

Iv. ID3

ID3 's detailed algorithm process is no longer verbose, note a few points on the line:

1. Although the same characteristics do not appear on each path

2. The algorithm is selected according to the maximum value of the information gain as a feature.

3. The algorithm generates leaf nodes with the condition that all instances on the node are of the same class, or basically belong to a class (information gain is less than the threshold), or there is no optional feature.

The difference between the corresponding C4.5 and the ID3 is that the method of selecting features is the information gain ratio.

Re-understanding decision Tree Series algorithm and logical return (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More