GBDT before the internship when I heard that the application is very wide, and now finally have the opportunity to understand the system.
First, we compare the random forest model in the last lesson and draw Adaboost-dtree (D)
Adaboost-dtree can be analogous to the Adaboost-stump model, it can be intuitively understood
1) Each round to adjust the weight of the sample
2) Get GT (D,ut)
3) Calculate the GT poll strength Alphat
Finally, a series of linear combinations of GT are returned.
Weighted error This is more difficult to engage in, there is no need to move the original model, through the input data on the fuss can achieve the same purpose?
Recall bagging's weighted strategy: every round of boostrapping, the weight of the sample is reflected in the copy number.
Now from a more general point of view, given a weighted u, if the data according to the ratio of the size of U sampling the sample, then the data in the final D also reflects weighted.
So the Adaboost-dtree prototype came out:
1) AdaBoost method
2) Sampling the sample according to the given weights, generating training data for each tree
3) Training each tree
This method is easy to produce autocracy if it does not limit the height of the tree. The reason is that if you throw all the information in, it's easy to cut it all.
Here is a question: why the random forest in every tree will not have this complete cut completely separate situation, even in decision stump that homework, there will be some stump to all cut.
1) personal feeling random forest samples, characteristics, branching conditions (b (x)) are randomness, the likelihood of complete separation is very small
2) As for decision Stump, if there is a decision stump can cut a knife, it will not use such a complex model
In general, each tree is weaker.
If weak to no longer weak, the tree leaves only one layer, this time decision tree degenerated into a decision stump.
Next, Lin began to explain some of the optimization view of the AdaBoost method and the intrinsic insights, here are some of the derivation of general ideas to remember.
The core of AdaBoost is the change in weight of each sample, and the so-called insights is starting from here.
1) Convert the AdaBoost sample weight iteration formula previously taught to form (with emphasis on the introduction of AlphaT):
Unt+1 = unt * EXP (-yn * Alphat * gt(xn))
2) According to the modified iterative formula, the weight of each sample in the T+1 wheel formula:
Unt+1 = 1/n * exp (-yn *σt=1,t (alphat * gt(xn)))
(Here is a place mentioned before, general N samples, each sample of the initial weight of the lamp, are 1/n)
Combined with the above two points, the UNt+1 and all Gt produced by the front T-wheel are related to the xn of the sample points (Σt=1,t (alphat * gt(xn))).
Lin points out a insights here: Voting score and margin can be linked together, analogous to the margin concept in SVM.
1) Consider each Gt(xn) as a transform of Xn, the alphat of the front as the weight after transformed
2) This form is much like the margin in the Hard-margin SVM
We must hope yn* (voting score) is as big as possible, because it means that the predictions are closer to the actual value, so we can get the UNt+1 to be smaller as T increases.
Along this line of thought, the general direction of AdaBoost should at least be the UNt+1 The smaller the better, then Σn=1,n (unT) should also be with the adaboost iteration and gradually reduced.
Therefore, the idea is: the prediction of the quasi-yn* (voting score) The bigger the better → σunt+1 smaller the better
Thus, the optimization objective function of adaboost can be given out roughly.
Please come out again our old friend Error0/1, compared to adaboost again produced a bound live ERROR0/1 the upper bound of the error measure, called "exponential error measure"
Such as:
Now that the target function is probably written out, here's how to minimize the target function.
This task is more troublesome, because Σ sets the exp and then sets the σ, therefore needs some predecessors ' wisdom.
Imitating the method of gradient descent, assuming that the t-1 wheel has been adaboost before, it is now required to have a function GT (x) (or H (x)).
In the first T-wheel, we walk along the step of the ITA in the direction of the function h (x), allowing the target function to move quickly toward the min direction . As follows:
1) Since the t-1 wheel has been executed in the front, it is possible to bashi the sub-simplification and merge some items into the function form of UNt
2) Using Taylor expansion of the xn=0 point, further simplifying ( Why do we use the 0 position of Taylor to unfold it, can be understood as H (x) just along the original σ1,t-1(alphat*g ' (xn) this function, A small step to move, which means little change, small change or even close to 0, so you can start at 0 o ' Taylor. Do not know whether this understanding is correct, the sense of it )
To this end, we have used the wisdom of our predecessors have made the objective function greatly simplified, the requirements of the two things:
1) What is h (x)?
2) What is ITA?
The method here is quite ingenious.
1) First put forward a fixed σunt, leaving behind the "change of a"
2) Re-analysis of the subsequent changes in this item, if the following changes to the smallest, then the smallest einu(h) (peripheral and the number of constant coefficients on the mating)
Therefore, you can get the conclusion: in the process of adaboost, the algorithm A is good GT!
Let's see how Ita asks for it.
The core is how Eada becomes the form that can be derivative of ITA:
EADA = U1t*exp (-ita) + u2t*exp (ITA) ...
EADA1 = U1t*exp (-ita) + ut2t*0 ... (EADA1 only consider exp (-ita), the remainder is 0)
EADA2 = u1t*0 + u2t * EXP (ITA) ... (EADA2 only consider exp (+ita), the remainder is 0)
Then, EADA = EADA1 + EADA1 = (σunt) * ((1-epson) exp (-ita) + epson*exp (ITA))
The subsequent derivation step is natural, so it validates the previous conclusion that Itat = sqrt ((1-epsont)/epsont) is optimal. The previous lesson gave this conclusion directly, and did not say why, this time is given a relatively theoretical deduction.
Then to the more general gradient boost promotion.
The way to generalize is to generalize the error measure function, as follows:
Follow this line of thought and move down the direction of the regression.
The overall target is two:
1) Solving the form of function h (x)
2) Solving the amplitude of the function h (x) movement
Take the form of H (x) First
Regression generally uses square error to go directly to Taylor:
1) The previous item is constant, because yn knows that SN also knows
2) The second is to take the derivative of S and to guide the value at SN point
Thus, it looks as if H (x) is infinitely large; it is unscientific, so add a penalty for H (X).
After penalize a toss, H finally has a smarty pants form: That is, regression with residuals.
Next, we will solve the problem of moving amplitude .
After some sex, Alphat also came out, is a single variable linear regression.
After the groundwork has been done, succinctly gave the form of GBDT:
1) Use C&rt to learn {x, yn-sn}, keep this round of learning the tree GT (x)
2) re-ask {GT (x), residual} linear regression, minimize the objective function to find out ITA
3) Update SN
After learning enough multiple times, return the combined GBDT.
"Gradient Boosted decision Tree" heights Field machine learning technology