Algorithms in Machine Learning (1) - Random Forest and GBDT Based on Decision Tree Model Combination

Last Update:2017-08-10 Source: Internet

Author: User

Keywords Cloud Computing Machine Learning Random Forest Decision Tree Model Combination

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Algorithms in Machine Learning (1) - Random Forest and GBDT Based on Decision Tree Model Combination.

Decision Tree This algorithm has many good features, such as training time complexity is low, the prediction process is relatively fast, the model is easy to display (easy to get the decision tree made of pictures) and so on. But at the same time, the single decision tree has some bad points, such as easy over-fitting, although there are some ways, such as pruning can reduce this situation, but not enough.

There are many algorithms associated with decision trees such as Boosting, Bagging, etc. The ultimate result of these algorithms is to generate N (possibly more than a few hundred) trees, which can greatly reduce the number of single decision trees Is akin to a triviality equal to a trivial one, and although each of the hundreds of decision trees is simple (as opposed to a C4.5 single-decision tree), they combine It is very powerful.

In recent years, paper, such as iccv this heavyweight meeting, iccv 2009 there are many articles which are related to Boosting and random forest. There are two basic forms of model combination + decision tree-related algorithms - random forest and GBDT (Gradient Boost Decision Tree), and other newer model combinations + decision tree algorithms are derived from the extension of these two algorithms. This article focuses primarily on GBDT, which is probably a rough guide to random forests because it is relatively simple.

Before looking at this article, it is recommended to take a look at machine learning and mathematics (3) and the papers cited therein, the GBDT in this paper is mainly based on this, while the random forest is relatively independent.

Basic content:

Here is just ready to briefly talk about the basic content, the main reference to someone else's article, for random forest and GBDT, there are two places is more important, first of all information gain, followed by the decision tree. Here's a special recommendation of Andrew Trees, Decision Trees Tutorial, and Information Gain Tutorial. Moore's Data Mining Tutorial series is very good, read the above two articles after the article can continue to read.

The decision tree is actually a method of dividing the space into hyperplanes. For each partition, the current space is divided into two parts. For example, the following decision tree:

Space is divided into the following look:

In this way, each leaf node is a disjoint region in space. When the decision is made, it will step by step according to the value of each dimension feature of the input sample, and finally the sample will fall into N regions One (assuming there are N leaf nodes)

Random Forest:

Random Forest is a recent fire algorithm, it has many advantages:

Good performance on datasets Many datasets presently have a great advantage over other algorithms. They are capable of handling very high-dimensional data and do not have to make a choice of features. After training, what can it do? feature is more important in the creation of random forest generlization error using unbiased estimation of training speed training in the course of the detection of the interplay between the feature easy to parallelize the method to achieve relatively simple

As the name implies, random forest uses a random way to build a forest. There are many decision trees in the forest. There is no correlation between each tree in a random forest. After getting the forest, when a new input sample comes in, let each decision tree in the forest make a judgment separately, to see which class the sample belongs to (for the classification algorithm) and see where A category is the most selected, to predict the sample for that category.

In building each decision tree, there are two things to be aware of-sampling and total splitting. The first is two random sampling process, random forest on the input data to be row, column sampling. For line sampling, there is a way of putting it back, that is, there may be duplicate samples in the sampled sample set. Assuming that there are N input samples, there are also N samples for sampling. This makes the input sample for each tree not all samples during training, making it relatively less prone to over-fitting. Then the column is sampled, selecting m of the M features (m << M). Afterwards, the decision tree is established by using a complete split of the sampled data, so that a leaf node in the decision tree can neither continue to split nor all the samples in the decision tree point to the same classification. Most of the decision tree algorithms are an important step - pruning, but not here, because the previous two random sampling process to ensure randomness, so even without pruning, it will not appear over-fitting.

Each of the random forests obtained by this algorithm is very weak, but everyone can combine it very well. I think this is a metaphor of stochastic forest algorithms: every decision tree is an expert who is good at a narrow field (because we choose m from M to learn each decision tree) so that in a random forest With a lot of experts who are proficient in different fields, you can view it from a different angle to a new problem (new input data), and ultimately vote out by experts.

Random forest process please refer to Mahout's random forest. Written on this page more clearly, which may not understand is Information Gain, you can look at the previously recommended Moore's page.

Gradient Boost Decision Tree:

GBDT is a widely used algorithm that can be used to do classification and regression. In a lot of data have a good effect. There are other names for this algorithm, such as MART (Multiple Additive Regression Tree), GBRT (Gradient Boost Regression Tree), Tree Net, etc. In fact, they are all one thing (refer to Wikipedia - Gradient Boosting) Friedman

Gradient Boost is actually a framework, which can be set into many different algorithms, you can refer to the machine learning and mathematics (3) in the explanation. Boost is the meaning of "promotion". Generally, the Boosting algorithm is an iterative process. Each new training is to improve the result of the previous one.

The original Boost algorithm gave each sample a weight value at the beginning of the algorithm, which was as important as it was initially. The model obtained at each step of the training will make the estimation of the data points either right or wrong. After each step, we increase the weight of points that are misclassified and reduce the weight of the points of the pairs. In this way, if some points Always be wrong, then it will be "serious concern", also be given a high weight. Then, after waiting N iterations (specified by the user), you get N simple learners, and then we combine them (say we can weight them or have them vote, etc.) Get a final model.

The difference between Gradient Boost and traditional Boost is that each calculation is to reduce the last residual, and in order to eliminate the residual, we can create a new Gradient direction with residual reduction model. So, in Gradient Boost, the resume for each new model is to reduce the residuals of the previous models in the gradient direction, which is quite different from the traditional Boost weighting the right and wrong samples.

An important part of the classification problem is called Multi-Class Logistic, that is, the multi-class Logistic problem, which is suitable for those problems whose number of classes> 2, and in the classification result, the sample x does not necessarily belong to only one The class can get the probability that the sample x belongs to more than one class respectively (it can be said that the estimation y of the sample x conforms to a certain geometric distribution). This actually belongs to the discussion in the Generalized Linear Model and will not be discussed here. The opportunity to use a special chapter to do it. Here is a conclusion: If a classification problem is in line with the geometric distribution, then you can use Logistic transformation for the subsequent operation.

It is assumed that for a sample x it may belong to K categories with estimated values F1 (x) ... FK (x), respectively, the Logistic transformation is as follows, the logistic transformation is a smooth and normalized data (making the length of the vector 1) Process, the result is the probability pk (x) belonging to category k,

For the result of Logistic transformation, the loss function is:

Where yk is the estimated value of the input sample data, yk = 1 when one sample x belongs to category k, and yk = 0 otherwise.

The Logistic transformation equation into the loss function, and its derivation, you can get the gradient of the loss function:

The above is more abstract, here is an example:

Suppose input data x may belong to 5 categories (1,2,3,4,5 respectively). In the training data, x belongs to category 3, then y = (0, 0, 1, 0, 0) (X) = (0.16,0.21,0.29,0.16,0.16) obtained after Logistic Transform, y - p obtains a gradient g: F (x) = 0,0.3,0.6,0,0, (-0.16, -0.21, 0.71, -0.16, -0.16). Observe here to get a more interesting conclusion:

Suppose gk is the gradient of a sample in a certain dimension (a certain classification):

If gk> 0, the larger the probability p (x) in this dimension, the more it should be raised. For example, if the probability of the third dimension above is 0.29, it should be increased, which should go in the "correct direction"

The smaller the more accurate the estimate is

When gk <0, the smaller the smaller the negative, the more the probability in this dimension should be reduced. For example, the second dimension of 0.21 should be reduced. Belonging to "the wrong direction" forward

The bigger the negative, the less it means that this estimation is "not bad"

In general, the best gradient for a sample is the closer the gradient is to zero. So, we need to be able to make the estimate of the function move the gradient in the opposite direction (moving in the negative direction in the dimension of> 0, moving in the positive direction in the dimension of <0) to make the gradient as much as possible = 0) The algorithm will pay serious attention to those samples with relatively large gradients, similar to the meaning of Boost.

After getting the gradient, it is how to reduce the gradient. Here is an iterative + decision tree approach that is used. When initialized, an estimate function F (x) is given casually (let F (x) be a random value, and let F (x) = 0) , And then set up a decision tree according to the current gradient of each sample one step after each iteration. Let the function move in the opposite direction of the gradient, eventually making the gradient smaller after N iterations.

The decision tree established here is not the same as an ordinary decision tree. First, the decision tree is fixed by the number of leaf nodes J. When J nodes are generated, no new node is generated.

Algorithm flow is as follows: (reference from treeBoost thesis)

0. Indicates that an initial value is given

1 said the establishment of M decision tree (iteration M times)

2. Represents a logistic transformation of the function estimate F (x)

3. The following operations are performed on K categories (in fact, this for loop can also be understood as a vector operation. Each sample point xi corresponds to K possible categories yi, so yi, F (xi), p (xi ) Is a K-dimensional vector, it may be easy to understand that)

4. Indicates the direction of the gradient from which the residuals are reduced

5. It means that according to each sample point x, and its residual gradient decreasing direction, we get a decision tree consisting of J leaf nodes

6. When the decision tree is established, the gain of each leaf node can be obtained by this formula (this gain is used in the prediction)

In fact, the composition of each gain is also a K-dimensional vector, indicating that if a sample point falls into the leaf node during the decision tree prediction, what is the value of the corresponding K classification? For example, GBDT has three decision trees. When a sample point is predicted, it also falls into three leaf nodes with the gain of (assuming 3 classification problems):

(0.5, 0.8, 0.1), (0.2, 0.6, 0.3), (0.4, 0.3, 0.3), then the resulting classification is the second because the decision tree that selects category 2 is the most.

7. The idea is to combine the current decision tree with the previous decision tree as a new model (similar to the one in 6)

GBDT algorithm probably talked about here, hoping to make up for the last article did not make it clear part :)

achieve:

See understand the algorithm, you need to go to achieve, or look at other people's code, here recommend Wikipedia gradient boosting page, here are some of the open source software to achieve some, for example, the following: http: // elf -project.sourceforge.net/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More