[Machine learning & Algorithm] Decision tree and Iteration Decision tree (GBDT)

Last Update:2015-08-16 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

After talking about the tree in the data structure (for details, see the various trees in the data structure in the previous blog post), let's talk about the various tree algorithms in machine learning algorithms, including ID3, C4.5, cart, and the tree model based on integrated thinking Random forest and GBDT. This paper gives a brief introduction to the basic ideas of various tree-shape algorithms, and focuses on the GBDT algorithm called "Fighter" in the algorithm, and the "Dragon-Slayer Knife" in machine learning.

1. Model of decision Tree

Decision tree is a basic classification and regression method, which can be considered as a set of if-then rules. The decision tree consists of a node and a forward edge, the inner node represents the feature attribute, and the outer node (leaf node) represents the category.

A legend for the decision tree:

The decision tree divides the entire feature space according to the attribute classification of the steps, thus distinguishing between different classification samples, as shown in:

According to the fact that we are not difficult to think of, to meet the sample partition decision tree has countless kinds, what kind of decision tree is a good decision tree?

　　the selection criterion of a good decision tree is a decision tree with less contradiction to training data, and it has good generalization ability. the implication is that a good decision tree not only has a good classification effect on the training sample, but also has a low error rate for the test set.

2. Basic knowledge of decision trees

A complete decision tree learning algorithm consists of three main steps, namely:

1) selection of features;

2) generation of decision trees;

3) Pruning of decision trees.

Before we introduce the decision Tree Learning algorithm, we will briefly discuss some basic concepts:

　　1) entropy (entropy)

　　in information theory and probability statistics, entropy is a measure of the uncertainty of random variables. set X is a discrete random variable that takes a finite number of values, and its probability distribution is:

P (X=XI) =pi, i=1,2, ..., n

The entropy of the random variable x is defined as:

H (X) =-∑pi * logpi, i=1,2, ..., n

Entropy only depends on the distribution of x, and the value of x is not related, entropy is used to measure uncertainty, when the entropy is greater, the probability that the greater the uncertainty of the x=xi, the smaller, in the machine semester classification said that the greater the entropy is the category of greater uncertainty, and the smaller, when the random variable value is two, The variation curve of entropy with probability is as follows:

When P=0 or P=1, h (P) = 0, the random variable has no uncertainty at all, and when p=0.5, H (p) = 1, the uncertainty of the random variable is greatest.

　　Conditional entropy (conditional entropy): Represents the uncertainty measure of the random variable y under the condition of the random variable x.

set a random variable (x, y) whose joint probability distribution is P (x, y) = Pij (i=1,2, ..., n; j=1,2, ..., m), the conditional entropy H (y|) of the random variable Y under the given condition of the random variable X x), defined as the mathematical expectation of X for the entropy of the conditional probability distribution of y under given conditions:

H (y| X) =∑pi*h (y| X=XI)

Here, Pi=p (X=XI), i=1,2, ..., N.

　　2) Information gain (information gain)

　　The information gain indicates the degree to which the information of the class Y is reduced by the information of the characteristic x.

The information gain of the training data set D (d, A) is defined as the empirical entropy H (d) of the set D and the empirical condition Entropy H (d|) of D under the given condition of the characteristic a A) The difference between

G (d, A) =h (d)-H (d| A

The feature with large information gain has stronger classification ability.

　　3) Information gain ratio (information gain ratio)

The information gain ratio GR (d, a) is defined as its information gain G (d, a) and the ratio of the entropy ha (d) for the value of the characteristic a in the training data set D, i.e.

GR (d, a) =g (d, a)/ha (d)

Wherein, HA (D) =-∑| di|/| d|*log2| di|/| D|, N is the number of feature a values.

　　4) Gini index (Gini index)

In the classification problem, assuming that there are k classes, the probability of a sample belonging to the K class is PK, then the Gini index of the probability distribution is defined as:

Gini (P) =∑pk (1-PK) =1-∑PK2

For the two classification problem, if the probability of a sample point belonging to the 1th class is P, the Gini index of the probability distribution is:

Gini (P) =2p (1-P)

For a given sample set D, its Gini index is:

Gini (D) =1-∑ (| ck|/| d|) 2

Here, CK is a subset of the samples in D that belong to category K, and K is the number of classes.

If sample set D is divided into D1 and D2 according to whether the feature a is taken to a certain possible value A, then under the condition of feature a, the Gini index of Set D is defined as:

Gini (d,a) =| d1|/| D|*gini (D1) +| d2|/| D|*gini (D2)

　　The Gini index Gini (d) indicates the uncertainty of set D, the larger the Gini index, the greater the uncertainty of the sample set, which is similar to entropy.

3. ID3, C4.5&cart

In fact, different decision tree learning algorithms are only the basis of their selection characteristics, the decision tree generation process is the same (according to the current environment to choose the characteristics of greed).

The core of the ID3 algorithm is to apply the information gain criterion selection feature on each node of the decision tree, and each time chooses to divide the feature with the greatest information gain and construct the decision tree recursively.

The ID3 algorithm takes information gain as the characteristic of dividing training data set, and has a fatal disadvantage. The choice of the value of more features tend to have a large information gain, so ID3 preference to select the value of more characteristics.

Aiming at the insufficiency of the ID3 algorithm, the C4.5 algorithm chooses the characteristic according to the information gain ratio, and corrects the problem.

A cart refers to a categorical regression tree, which can be used both for classification and for regression. As a criterion for selecting features using the minimization of squared error as a regression tree, the use of the Gini index minimization principle, the feature selection, and the recursive generation of two-fork trees are used in the cart.

　　pruning of decision Trees : We know that decision trees use greedy methods to select features in the process of generation, so that the training data can be better fitted (in fact, from the extreme point of view, the decision tree fits the training set to 0 errors). The pruning of decision tree is to simplify the complexity of the model and prevent the cross-fitting of decision trees. The specific decision tree pruning strategy can be found in Hangyuan Li's statistical learning method.

4. Random Forest

Random forest is a kind of integrated learning + decision tree Classification model, it can use integrated thinking (voting strategy) to improve the classification performance of single decision tree (popularly speaking is "Three Stooges, top a Zhuge Liang").

Integrated learning and decision tree in one, random forest algorithm has many advantages, the most important of which is in the random forest algorithm Each tree is grown to its fullest extent and there is no pruning process.

Random Forest introduces two randomness-random selection samples (bootstrap sample) and random selection features　　 for training. the introduction of two randomness is critical to the classification performance of random forests. Because of their introduction, random forests are not prone to overfitting and have good noise immunity (e.g., insensitive to default values).

　　A detailed description of the random forest can be found in the previous post, Random Forest (Forest).

5. GBDT

The iterative decision Tree GBDT (Gradient boosting decision tree) is also known as Mart (multiple Additive Regression tree) or GBRT (Gradient boosting Regression tree) is also a decision tree model based on integrated thinking, but it is fundamentally different from the random forest. It has to be mentioned that GBDT is the most commonly used machine learning algorithm in the current competition, because it can not only be applied to a variety of scenarios, but also, more commendable, the GBDT has a superior accuracy rate. This is why many people call GBDT the "Dragon Slayer" in the field of machine learning.

So the algorithm of the fork, in the end how to do it? Speaking of which, we have to say "GB" in GBDT (Gradient boosting). Gradient boosting principle is quite complex, but can not understand it does not hinder our understanding and understanding of GBDT, the detailed explanation of Gradient boosting see Wiki encyclopedia.

Here, I quote another netizen's explanation to illustrate the understanding of gradient boosting in GBDT:

The following section is from the GBDT (MART) Iteration Decision Tree Primer Tutorial | Brief introduction ".

"Boosting, iterative, that is, by iterating over trees to make decisions together. How does this come true? Is it that each tree is trained independently, for example a this person, the first tree thought is 10 years old, the second tree thinks is 0 years old, the third tree thinks is 20 years old, we take the average 10 year old to make the final conclusion? Of course not! And not to say that this is the voting method is not GBDT, as long as the training set is not changed, independent training three times the three trees must be identical, it is completely meaningless. As I said before, GBDT is a summation of all the tree's conclusions, so it can be thought that the conclusion of each tree is not the age itself, but the cumulative amount of the age. The core of the GBDT is that each tree learns the residuals of all previous tree conclusions and that this residual is a cumulative amount that can get a real value after a predicted value. For example, A's true age is 18 years old, but the first tree predicts the age is 12 years old, the difference is 6 years old, namely the residual difference is 6 years old. So in the second tree we set the age of a to 6 years old to study, if the second tree can really divide a into a 6-year-old leaf node, the sum of the two trees is the true age of A; if the second tree concludes that it is 5 years old, a still has a 1-year-old residual, and a in the third tree becomes 1 years old and continues to learn. This is the meaning of gradient boosting in GBDT. ”

In fact, from here we can see GBDT and random forest The essential difference, GBDT is not simply the use of integrated ideas, and it is based on the study of residual error. Here we use a classic example of a GBDT to explain.

Suppose we now have a training set with only 4 people in the training set, A,b,c,d, and their age is 14,16,24,26. Among them, A and B are senior and senior students, c,d are fresh graduates and employees working for two years respectively. If you are training with a traditional regression decision tree, you will get the result as shown in 1:

Figure 1

Now we use GBDT to do this, because the data is too small, we limit the leaf node to do more than two, that is, each tree has only one branch, and limited to learn only two trees. We will get the results as shown in 2:

Figure 2

In the first tree branch and the same as in Figure 1, because A, B age is more similar, c,d age is more similar, they are divided into two, each dial with the average age as a predictor. At this point the residuals are calculated (the meaning of the residuals is: The predicted value of a + A's residuals = the actual value of a ), so A's residuals are 16-15=1 (note that the predicted value of a is the sum of all the preceding trees, and there is only one tree in front of it, so it's 15 straight. If there are any trees, they need to be summed up as a predicted value. The residual difference of a,b,c,d was -1,1,-1,1 respectively. Then we take the residuals to replace the original value of the A,b,c,d, to the second tree to learn, if our predictions and their residuals are equal, then simply add the second tree's conclusion to the first tree will be able to get the real age. The data here is clearly what I can do, the second tree has only two values of 1 and-1, directly into two nodes. At this point everyone's residuals are 0, that is, everyone gets the real predicted value.

The results of the final GBDT are:

A:14 years old students, shopping less, often ask seniors questions; predict age a = 15–1 = 14;

B:16 years old three students, shopping less, often asked questions by the younger brother; predict age B = 15 + 1 = 16;

C:24 year old fresh graduates, shopping more, often ask senior questions; predictive age c = 25–1 = 24;

D:26 years of work two years of staff, shopping, often asked questions by the younger brother; predict age d = 25 + 1 = 26.

So where does the gradient show? In fact, back to the end of the first tree to think, no matter what the cost function is, is the mean variance or the difference, as long as it is measured by error, residual vector (-1, 1,-1, 1) is its global optimal direction, this is gradient.

Note: Figure 1 and Figure 2 have the same final effect, why do you need GBDT? The answer is to cross-fit. Over-fitting refers to the fact that in order to make the training set more accurate, there are many "rules set up only on the training set", which makes the current law of changing a dataset inapplicable. As long as the leaf nodes of a tree are allowed enough, the training set can always be trained to a 100% accuracy rate. Between the training accuracy and the actual accuracy (or test precision), the latter is what we want to really get. We found that figure 1 in order to achieve 100% accuracy using 3 feature (Internet time, time, net purchase amount), wherein the branch "Internet time >1.1h" obviously has been fitted, this data set, a, B may happen a daily 1.09h online, b Internet for 1.05 hours, but the use of Internet time is not >1.1 hours to judge the age of all people is clearly contrary to common sense; in contrast to Figure 2 boosting although used two trees, but in fact, only used 2 feature to get it done, The latter feature is a question-and-answer ratio, and obviously the basis for figure 2 is more plausible.

It can be seen that gbdt, like random forests, is not prone to overfitting and is highly accurate.

6. Reference Content

[1] Hangyuan Li "Statistical Learning method"

[2] GBDT (MART) Iteration decision tree Getting Started Tutorial | Brief introduction

[Machine learning & Algorithm] Decision tree and Iteration Decision tree (GBDT)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More