Algorithm in Machine Learning (1)-decision tree model combination: Random forest and gbdt

Source: Internet
Author: User

Copyright:

This article by leftnoteasy released in http://leftnoteasy.cnblogs.com, this article can be all reproduced or part of the use, but please note the source, if there is a problem, please contact the wheeleast@gmail.com

Preface:

Decision treeAlgorithmIt has many good features, such as low training time complexity, fast prediction process, and easy model display (easy to make the decision tree into images. But at the same time, there are some bad aspects of a single decision tree, such as over-fitting, although there are some methods, such as pruning can reduce this situation, but it is still not enough.

Model combinations (such as boosting and bagging) have many algorithms related to decision trees. The final result of these algorithms is to generate N (hundreds or more trees may exist) trees, this can greatly reduce the problems caused by a single decision tree. It is a bit similar to the practice of three skypixers, although each of these hundreds of decision trees is simple (compared with C4.5), they are very powerful in combination.

In recent years, there have been many important iccv conferences, such as iccv.ArticleIt is related to boosting and random forest. Model combination + Decision Tree algorithms have two basic forms: Random forest and gbdt (gradient boost demo-tree ), other newer model combinations and Decision Tree algorithms come from the extensions of these two algorithms. This article focuses mainly on gbdt. It is only a rough mention of random forest because it is relatively simple.

Before reading this article, we suggest you first look at machine learning and mathematics (3) and the papers referenced in this article. The gbdt in this article is mainly based on this, and the random forest is relatively independent.

Basic Content:

Here is just a brief introduction to the basic content. I will mainly refer to other people's articles. There are two important aspects for random forest and gbdt: information gain, and decision tree. Here we recommend Andrew Moore's demo-trees tutorial and information gain tutorial. Moore's Data Mining tutorial series is very good. After reading the two content mentioned above, we can continue reading the article.

The decision tree is actually a way to divide the space with a hyperplane.CurrentThe space is split into two parts, for example, the following decision tree:

Divide the space into the following:

In this wayEach leaf node is in a non-Intersecting area of the space.When making a decision, the sample falls into one of N regions based on the values of the feature values of each dimension of the input sample (assuming there are n leaf nodes)

Random forest (random forest ):

Random forest is a popular algorithm recently. It has many advantages:

    • Excellent Performance in data sets
    • In many Current datasets, it has great advantages over other algorithms.
    • It can process high-dimensional (with many feature values) data without feature selection.
    • What feature is important after training?
    • When creating a random forest, we use unbiased estimation for generlization error.
    • Fast Training
    • During training, the mutual influence between feature can be detected.
    • Easy to implement parallelization
    • Easy to implement

As the name suggests, a random forest uses a random method to create a forest. There are many decision trees in the forest. Each decision tree in a random Forest is not associated. After the forest is obtained, when a new input sample enters, each decision tree in the forest will be judged separately, check the type of the sample (for classification algorithms), and then determine the type of the sample that is selected most.

When creating a decision tree, pay attention to two points: Sampling and full split. First, there are two random sampling processes. Random forest needs to sample the input data in rows and columns. For row sampling, there may be duplicate samples in the sample set that has been sampled. Assume that N samples are input, and N samples are sampled. This makes the input samples of each tree not all during training, making over-fitting relatively difficult. Then, perform column sampling. From the M feature, select M (M <m ). After that, a decision tree is created using the fully split method for the sampled data, so that a leaf node of the decision tree cannot continue to split, or all the samples in the sample point to the same category. Many Decision Tree algorithms generally take an important step-pruning, but this is not the case here. Because the previous two random sampling processes ensure randomness, even if no pruning is performed, no over-fitting occurs.

Every tree in the random forest obtained by this algorithm is very weak, but the combination is very powerful. I think the following is a metaphor for the random forest algorithm: every decision tree is an expert proficient in a narrow field (because we choose m from M feature to let every decision tree learn ), in this way, many experts are proficient in different fields in the random forest. You can view a new problem (new input data) from different perspectives, finally, various experts vote for the results.

For the process of random forest, see random forest of mahout. The information gain is clearly written on this page. You can see the previously recommended Moore page.

Gradient boost demo-tree:

Gbdt is a widely used algorithm that can be used for classification and regression. It has good results on a lot of data. The gbdt algorithm also has some other names, such as Mart (Multiple additive regression tree), GBRT (gradient boost regression tree), and tree net, actually, they are all one thing (refer to Wikipedia-gradient boosting). The inventor is Friedman.

Gradient boost is actually a framework that can include many different algorithms. For more information, see machine learning and mathematics (3. Boost means "lifting". Generally, the boosting algorithm is an iterative process. Every new training is to improve the previous result.

The original boost algorithm adds a weight value to each sample at the beginning of the algorithm. At the beginning, everyone is equally important. The model obtained during each step of training will make the estimation of data points correct and wrong, and we will add the weight of the points after each step, this reduces the weights of vertices in the correct way, so that if some vertices are always divided incorrectly, they will be "seriously concerned" and assigned a high weight. After N iterations (specified by the user), n simple classifiers (Basic learner) will be obtained ), then we combine them (for example, we can weight them or let them vote) to get a final model.

The difference between gradient boost and traditional boost is that each calculation is to reduce the residual. To eliminate the residual, we canGradientCreate a new model. Therefore, in gradient boost, the resume of each new model is used to reduce the residual values of the previous model in the gradient direction, compared with traditional boost, weighting correct and wrong samples is quite different.

In classification, there is a very important issue called multi-class logistic, that is, the multi-classification logistic problem, which is applicable to the problems with the number of classes> 2, and in the classification result, sample X does not necessarily belong to only one class. We can obtain the probability that sample X belongs to multiple classes respectively (we can also say that the estimated y of sample X conforms to a certain geometric distribution ), this is actually the content discussed in the generalized linear model. I will not talk about it here. I will have the opportunity to use a special chapter later. Here is a conclusion:If a classification problem conforms to the geometric distribution, logistic transformation can be used for subsequent calculation..

Assume that a sample X may belong to k classifications, and its estimated values are F1 (x )... FK (x ),The logistic transformation is as follows:, Logistic transformation is a smooth process that standardizes data (so that the vector length is 1). The result is a probability PK (x) of Category K ),

For logistic transformation results, the loss function is:

Here, YK is the estimated value of the input sample data. When a sample X belongs to Category K, yk = 1; otherwise, yk = 0.

The formula of logistic transformation is brought into the loss function, and the derivation is obtained.Gradient of loss function:

The above is more abstract. The following is an example:

Assume that input data X may belong to five categories (1, 2, 3, 4, 5). In training data, if X belongs to Category 3, y = (0, 0, 1, 0, 0). Assuming that f (x) = (0, 0.3, 0.6, 0, 0) is estimated by the model, p (x) = (0.16, 0.21, 0.29, 0.16, 0.16), Y-P to obtain the gradient G :(-0.16,-0.21, 0.71,-0.16,-0.16 ). Here is an interesting conclusion:

Assume that GK is the gradient of a sample in one dimension (a classification:

When GK is greater than 0, the larger the probability p (x) in this dimension is, the higher the probability p (x). For example, if the probability of the first dimension is 0.29Move in the correct direction

The smaller the value, the more accurate the estimation is"

When GK is less than 0, the more negative the probability of this dimension should be reduced. For example, the second-dimension 0.21 should be reduced. BelongWe should move in the "wrong opposite direction"

The larger the value, the less the negative value indicates that the estimation is "no error"

In general,For a sample, the best gradient is the gradient closer to 0.. Therefore, we must be able to make the estimated value of the function move the gradient in the opposite direction (> 0 dimension, to the negative direction, <0 dimension, to the square direction) in the end, make the gradient = 0 as much as possible, andThis algorithm will pay close attention to the samples with relatively large gradients, which is similar to boost..

After a gradient is obtained, the gradient is reduced. Here isIteration + Decision TreeWhen initializing, a random estimation function f (x) (which can make f (x) a random value or f (x) is provided) then, a decision tree is created based on the gradient of each sample in each iteration step. Let the function move toward the opposite direction of the gradient, and eventually make the gradient smaller after N steps of iteration.

The decision tree created here is not the same as the general decision tree. First, this decision tree is fixed with the number of leaf nodes J. When J nodes are generated, no new nodes will be generated.

The algorithm process is as follows: (refer to the self-treeboost thesis)

0. indicates an initial value.

1. Create M Decision Trees (iterative m times)

2. Logistic transformation is performed on the function estimated value f (x ).

3. perform the following operations for K classes (in fact, this for loop can also be understood as a vector operation. each sample point Xi corresponds to k possible classes Yi, So Yi, F (XI) and P (xi) are all K-dimensional vectors, which may be easy to understand)

4. Gradient Direction for residual reduction

5. indicates that a decision tree composed of J leaf nodes is obtained based on the gradient direction of each sample point X and its residual reduction.

6.After the decision tree is created, the gain of each leaf node can be obtained through this formula (this gain is used for prediction)

The composition of each gain is actually a K-dimensional vector, indicating that if a sample point falls into this leaf node during decision tree prediction, the value of the corresponding K categories is. For example, gbdt obtains three decision trees. When a sample point is predicted, it will also fall into three leaf nodes, and its gain is (assuming it is a problem of 3 categories ):

(0.5, 0.8, 0.1), (0.2, 0.6, 0.3), (0.4, 0.3, 0.3), then the final classification is the second, because the decision tree with Category 2 selected is the most.

7. It means to combine the current decision tree with those previous decision trees as a new model (similar to the example in section 6)

The gbdt algorithm is probably mentioned here, and I hope it can make up for what is not clearly stated in the previous article :)

 

Implementation:

If you understand the algorithm, you need to implement it or see what others implement.CodeHere, we recommend the gradient boosting page in Wikipedia, below there are some open source software implementation, such as the following: http://elf-project.sourceforge.net/

References:

In addition to the reference content (links already provided) in this article, we will also refer to the article by the tech expert: greedy function approximation: A gradient boosting machine.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.