Copyright Notice:
This article was published by Leftnoteasy on http://leftnoteasy.cnblogs.com , this article can be reproduced or partially used, but please indicate the source, if there is a problem, please contact [email protected]
Objective:
Decision tree This algorithm has many good characteristics, for example, the training time complexity is low, the prediction process is relatively fast, the model is easy to display (easy to get the decision tree into a picture display) and so on. But at the same time, the single decision tree has some bad places, such as easy to over-fitting, although there are some methods, such as pruning can reduce the situation, but it is not enough.
Model combinations (such as boosting,bagging, etc.) and decision tree-related algorithms are more, the final result of these algorithms is to generate N (there may be more than hundreds of) tree, which can greatly reduce the single decision tree caused by the problem, a bit similar to Three Stooges equals a Zhuge Liang's practice, Although each of the hundreds of decision trees is simple (relative to the single decision Tree of C4.5), they are very powerful in combination.
in recent years paper, such as the ICCV of this heavyweight meeting, ICCV There are many articles in the year that are related to boosting and random forest. Model Combination + Decision tree-related algorithms have two basic forms-random forest and GBDT (Gradient Boost decision Tree), the other comparison of new model combinations + decision tree algorithms are derived from both of these algorithms. This article focuses primarily on GBDT, which is only approximate for random forests because it is relatively simple.
before looking at this article, it is recommended to take a look Machine Learning and Mathematics (3) The GBDT in this paper are mainly based on this, whereas random forests are relatively independent.
Basic content:
Here is just ready to talk about the basic content, the main reference to other people's article, for Random Forest and GBDT, there are two places more important, first is information gain, followed by decision tree. Andrew Moore Daniel's decision Trees Tutorial, and information Gain Tutorial are particularly recommended here. Moore's data Mining Tutorial series is very good, read the above mentioned two content after the article can continue to read.
A decision tree is actually a method of dividing space with a hyper-plane, and each time it is split, it divides the current space into two, for example, the following decision tree:
is to divide the space into the following way:
This makes each leaf node in the space of a disjoint area , in the decision-making, the input sample according to the value of each dimension feature, step by step down, and finally make the sample fall into one of the n regions (assuming there are n leaf nodes)
Stochastic forest (random Forest):
Random Forest is a recent comparison of fire algorithms, it has many advantages:
- performed well on the dataset
- on many of the current datasets, there are significant advantages over other algorithms
- it can handle data of very high dimensions (feature many) without having to make feature selection
- , after training, it can give which feature is more important
- when creating a random forest, use unbiased estimates for generlization error
- can detect inter-feature interaction during training
- implementation is relatively simple
Random forest as the name implies, is a random way to build a forest, the forest has a lot of decision tree composition, random forest of each decision tree is not associated. After getting the forest, when a new input sample enters, let each decision tree in the forest make a separate judgment to see what category the sample should belong to (for the classification algorithm), and then see which class is chosen most, and predict the sample as the category.
In the process of building each decision tree, there are two points to note-sampling and complete splitting. The first is the two random sampling process, where the random forest samples the input data for rows and columns. For line sampling, there may be duplicate samples in the sample set that is put back, that is, in the sampled collection. Assuming that the input sample is N, the sampled sample is also N. This makes the input samples of each tree not all samples at the time of training, making it relatively difficult to appear over-fitting. Then sample the columns, from M feature, select m (M << m). The decision tree is then created using a completely fragmented approach to the sampled data, so that one of the leaf nodes of the decision tree is either unable to continue splitting, or all the samples in it are pointing to the same category. A lot of decision tree algorithms are an important step-pruning, but this does not work here, because the previous two random sampling process to ensure the randomness, so even if not pruning, there will be no over-fitting.
Each tree in the random forest obtained by this algorithm is very weak, but it is very powerful to combine them. I think we can do this. Random forest algorithms: Each decision tree is an expert in a narrow field (because we choose M from M feature to learn from each decision tree), so there are many experts in the random forest who are proficient in different fields, a new problem (new input data), It can be viewed in different ways, and ultimately by the individual experts, the results are voted on.
for random forest processes please refer to the mahout random forest . This page is more clearly written, which may not be clear is information Gain, you can look at the previous recommended Moore's page.
Gradient Boost decision Tree:
GBDT is a widely used algorithm that can be used for classification and regression. There are good results on a lot of data. GBDT This algorithm has some other names, such as Mart (multiple Additive Regression tree), GBRT (Gradient Boost Regression tree), tree net, etc. In fact they are all a thing (refer to from wikipedia–gradient boosting), the inventor is Friedman
Gradient boost is actually a framework that can be nested into many different algorithms, and can refer to the explanations in machine learning and math (3). Boost is the meaning of "ascension", the general boosting algorithm is an iterative process, each new training is to improve the previous results.
The original boost algorithm was to assign a weight value to each sample at the beginning of the algorithm, and everyone was just as important when it started. In each step of the training to get the model, will make the data points of the estimation of the wrong, we at each step after the end, increase the weight of the point points, reduce the point of the points of the weight, so that some points if always be divided wrong, then will be "serious concern", also be assigned a very high weight. Then we have n iterations (specified by the user), we get n simple classifiers (basic learner), and then we combine them (for example, they can be weighted, or let them vote, etc.) and get a final model.
The difference between Gradient boost and traditional boost is that each time the calculation is done to reduce the last residual (residual), and in order to eliminate the residuals, we can create a new model in the gradient (Gradient) direction of the residuals reduction . So, in gradient boost, each new model's CV is designed to reduce the residuals in the previous model to the gradient, which is very different from the traditional boost weighting of correct and wrong samples.
In the classification problem, there is a very important content called Multi-Class logistic, that is, the multi-classification of the logistic problem, which applies to those problems of the number of categories >2, and in the classification results, Sample x does not necessarily belong to a class can get the probability that the sample x belongs to more than one class (or the estimated y of sample x conforms to a certain geometric distribution), which is actually the content discussed in the generalized Linear model. Have a chance to do it again in a special chapter. Here is a conclusion: if a classification problem conforms to the geometric distribution, then the logistic transformation can be used to perform the subsequent operation .
Suppose that for a sample x, it might belong to the K classification, with an estimate of F1 (x) ... FK (x), thelogistic transformation is the process of smoothing and normalizing the data (so that the length of the vector is 1), and the result is the probability PK (x) belonging to category K,
For the result of the logistic transformation, the loss function is:
Where YK is an estimate of the input sample data, when a sample x belongs to Class K, YK = 1, otherwise yk = 0.
By bringing the equation of the logistic transformation into the loss function and deriving it, the gradient of the loss function can be obtained:
The above-mentioned is more abstract, here is an example:
Assuming that the input data x may belong to 5 categories (1,2,3,4,5), the training data, X belongs to Category 3, then y = (0, 0, 1, 0, 0), assuming that the model estimates the resulting f (x) = (0, 0.3, 0.6, 0, 0), the data after the logistic transformation P (x) = (0.16,0.21,0.29,0.16,0.16), y-p get gradient G: (-0.16,-0.21, 0.71,-0.16,-0.16). A more interesting conclusion can be obtained from this observation:
Assume that the GK is a gradient on a sample when a certain dimension (a category):
Gk>0, the greater the probability that its in this dimension P (x) should be increased, for example, the probability of the third dimension above is 0.29, it should be improved, belong to the "right direction" to advance
The smaller the figure, the more "accurate" the estimate.
Gk<0, the smaller, the more negative the probability of this dimension should be reduced, for example, the second dimension 0.21 should be reduced. should move in the opposite direction of the error.
The bigger, the less negative, the more "not wrong" estimate.
In general, for a sample, the ideal gradient is the closer to the 0 gradient . Therefore, we should be able to let the estimate of the function can make the gradient move in the opposite direction (>0 dimension, moving in the negative direction, the <0 dimension, moving in the positive direction) eventually make the gradient as =0 as possible, and the algorithm will be seriously concerned about those large gradient samples, Similar to what boost means .
After the gradient is obtained, it is how to reduce the gradient. Here is a method of iteration + decision tree , when initializing, give an estimate function f (x) arbitrarily (can let f (x) is a random value, can also let f (x) =0), and then each iteration of each step according to the current situation of each sample gradient, the establishment of a decision tree. Let the function move in the opposite direction of the gradient, resulting in a smaller gradient after the N-step iteration.
The decision tree built here is not the same as the ordinary decision tree, first of all, this decision tree is a leaf node number j fixed, when the J node is generated, no new nodes are generated.
The flow of the algorithm is as follows: (reference from Treeboost paper)
0. Indicates that given an initial value
1. Representation of the establishment of M tree (iterative M-times)
2. Represents a logistic transformation of the function estimate f (x)
3. For the K classification to do the following (in fact, this for loop can also be understood as the operation of the vector, each sample point XI corresponds to K kind of possible classification Yi, so Yi, F (xi), P (xi) are a k-dimensional vector, This may be easy to understand a little)
4. Gradient direction for reducing residuals
5. A decision tree consisting of J leaf nodes is obtained according to the gradient direction of each sample point x, which is reduced by its residual error.
6. for when the decision tree is established, the gain of each leaf node can be obtained by this formula (the gain is used in the prediction)
The composition of each gain is actually a k-dimensional vector, indicating if a sample point falls into the leaf node in the decision tree prediction, what is the value of the corresponding K-category? For example, GBDT got three decision trees, and a sample point would fall into 3 leaf nodes at the time of prediction, with the gain (assuming 3 classification):
(0.5, 0.8, 0.1), (0.2, 0.6, 0.3), (0.4, 0.3, 0.3), then the resulting classification is the second, because the decision tree that selects Category 2 is the most.
7. The idea is to merge the currently obtained decision tree with the previous decision trees as a new model (similar to the one given in 6)
GBDT's algorithm is probably here, hoping to make up for what was not clearly stated in the previous article:)
Realize:
See the algorithm, you need to implement, or look at others to implement the code, here to recommend Wikipedia in the gradient boosting page, there are some open source software in some implementations, such as the following: http://elf-project.sourceforge.net/
Resources:
In addition to the content referenced in the article (already given the link), the main reference is Friedman Daniel's article: greedy function approximation:a Gradient boosting machine
Algorithm in machine learning (1)-random forest and GBDT of decision tree model combination