Reprint Address: http://blog.csdn.net/w28971023/article/details/8240756
GBDT (Gradient boosting decision tree), also known as MART (multiple Additive Regression tree), is an iterative decision tree algorithm, which consists of multiple decision trees, The conclusions of all the trees are summed up to make the final answer. It is considered to be a strong generalization capability (generalization) algorithm with SVM at the beginning of the proposed method. In recent years, the machine learning model, which is used in the search sort, has aroused people's concern.
In this first give me the more recommended two English literature, like the English original students can directly read:
"1" boosting decision Tree Getting Started tutorial http://www.schonlau.net/publication/05stata_boosting.pdf
"2" Lambdamart for search sort Getting Started tutorial http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf
GBDT mainly consists of three concepts: Regression decistion Tree (ie, DT), Gradient boosting (that is, GB), Shrinkage (an important evolutionary branch of the algorithm, most of the source code is now implemented according to this version). With these three concepts in hand, you can understand how GBDT works, and continue to understand how it is used for search sequencing, which requires an additional understanding of the Ranknet concept and then merit perfection. The following is a fragment-by-piece description, which will eventually spell out the entire picture.
First, DT: Regression tree Regression Decision Trees
The decision tree (DT, decision tree) Most people first think of is the C4.5 classification decision tree. But if the GBDT in the beginning to think of the tree into a classification tree, that is a fruitless go to the black, all the way to all kinds of pits, eventually fell to hemoptysis or confused said is LZ himself AH has wood. Well, so do not think that GBDT is a lot of tree classification. Decision trees are divided into two main categories, Regression tree and classification tree. The former is used to predict real values such as the temperature of tomorrow, the age of the user, the degree of relevance of the page, and the latter for classifying label values such as sunny/cloudy/fog/rain, user gender, and whether the webpage is a spam page. It is important to emphasize that the former results are significant, such as 10 years old + 5 years old-3 years old = 12 years old, the latter is meaningless, such as male + male + female = In the end is a male or female? The core of GBDT is to accumulate all the results of the tree as the final result, as the previous age of the cumulative (-3 is plus minus 3), and the result of the classification tree is obviously no way to accumulate, so the tree in GBDT is a regression tree, not a classification tree, This is important for understanding GBDT (although GBDT can also be used for classification but not GBDT trees are classification trees). So how does the regression tree work?
The following is an example of a person's gender/age prediction, each instance is a person of our known gender/age, while feature includes the length of time the person is surfing the internet, the time of the Internet, and the amount spent on online shopping.
As a comparison, first of all the classification trees, we know that the C4.5 classification tree at each branch, is poor to lift each feature each threshold value, found to make according to feature<= threshold, and feature> The threshold value is divided into two branches of entropy maximum feature and threshold (the concept of maximum entropy can be understood as far as possible each branch of the male and female ratio is far away from 1:1), according to the standard branch to obtain two new nodes, the same way continue to branch until everyone is divided into gender-only leaf node, or to achieve a predetermined termination conditions, If the gender of the final leaf node is not unique, the sex of the leaf node is taken as the gender of the majority.
Regression tree The overall process is similar, but at each node (not necessarily the leaf node) will have a predictive value, in the case of age, the predicted value is equal to the average age of all people belonging to this node. Branching is poor at each threshold of each feature to find the best segmentation point, but the best measure is no longer the maximum entropy, but the minimization of the mean variance-that is (everyone's age-predicted age) ^2 sum/n, or each person's prediction error squared and divided by N. It's good to understand that the more people are predicted to go wrong, the more ridiculous the error, the greater the mean variance, and the most reliable branching basis can be found by minimizing the mean variance. Branches until the age of each leaf node is unique (this is too difficult) or to achieve a predetermined termination conditions (such as the maximum number of leaves), if the final leaf node of the age is not unique, then the average age of the people on that node as the leaf node of the predicted age. If you don't understand Google "Regression tree", or read the Regression Tree section of the first paper in this article.
Second, GB: gradient Iteration Gradient boosting
Well, I got a big headline, but in fact I don't want to talk more about the principle of gradient boosting, because I don't understand the principle of understanding the gradient boosting in GBDT. Like to break the casserole ask the end of the students can read this English wikihttp://en.wikipedia.org/wiki/gradient_boosted_trees#gradient_tree_boosting
Boosting, iterative, that is, by iterating over trees to make decisions together. How does this come true? Is it that each tree is trained independently, for example a this person, the first tree thought is 10 years old, the second tree thinks is 0 years old, the third tree thinks is 20 years old, we take the average 10 year old to make the final conclusion? --Of course not! And not to say that this is the voting method is not GBDT, as long as the training set is not changed, independent training three times the three trees must be identical, it is completely meaningless. As I said before, GBDT is a summation of all the tree's conclusions, so it can be thought that the conclusion of each tree is not the age itself, but the cumulative amount of the age. the core of the GBDT is that each tree learns the residuals of all previous tree conclusions and that this residual is a cumulative amount that can get a real value after a predicted value . For example, A's true age is 18 years old, but the first tree predicts the age is 12 years old, the difference is 6 years old, namely the residual difference is 6 years old. So in the second tree we set the age of a to 6 years old to study, if the second tree can really divide a into a 6-year-old leaf node, the sum of the two trees is the true age of A; if the second tree concludes that it is 5 years old, a still has a 1-year-old residual, and a in the third tree becomes 1 years old and continues to learn. This is the meaning of gradient boosting in GBDT, simple.
Iii. examples of working process of GDBT
Or age prediction, the simple training set only 4 people, a,b,c,d, their age is 14,16,24,26. Among them, A and B are senior and senior students, c,d are fresh graduates and employees working for two years respectively. If you are training with a traditional regression decision tree, you will get the result as shown in 1:
Now we use GBDT to do this, because the data is too small, we limit the leaf node to do more than two, that is, each tree has only one branch, and limited to learn only two trees. We will get the results as shown in 2:
In the first tree branch and the same as in Figure 1, because A, B age is more similar, c,d age is more similar, they are divided into two, each dial with the average age as a predictor. At this point the residuals are calculated (the meaning of the residuals is: The predicted value of a + A's residuals = the actual value of a ), so A's residuals are 16-15=1 (note that the predicted value of a is the sum of all the preceding trees, and there is only one tree in front of it, so it's 15 straight. If there are any trees, they need to be summed up as a predicted value. The residual difference of a,b,c,d was -1,1,-1,1 respectively. Then we take the residuals to replace the original value of the A,b,c,d, to the second tree to learn, if our predictions and their residuals are equal, then simply add the second tree's conclusion to the first tree will be able to get the real age. The data here is clearly what I can do, the second tree has only two values of 1 and-1, directly into two nodes. At this point everyone's residuals are 0, that is, everyone gets the real predicted value.
In other words, the predicted values of a,b,c,d now coincide with the real age. perfect!:
A:14 years old students, shopping less, often ask seniors questions; predictive age A = 15–1 = 14
B:16-year-old three students; less shopping, often asked questions by his brother; predictive age B = 15 + 1 = 16
C:24 year old fresh graduates; shopping more, often ask seniors questions; predictive age c = 25–1 = 24
D:26 years of work two years of staff, shopping, often asked questions by the younger brother; predictive Age d = 25 + 1 = 26
So where does the gradient show? In fact, back to the end of the first tree to think, no matter what the cost function is, is the mean variance or the difference, as long as it is measured by error, residual vector (-1, 1,-1, 1) is its global optimal direction, this is gradient.
Speaking of which we have GBDT the most core concept, the operation of the process is finished! Yes, it's that simple. But it's easy to find three questions here:
1) Since Figure 1 and Figure 2 have the same final effect, why do we need GBDT?
The answer is to cross-fit. Over-fitting refers to the fact that in order to make the training set more accurate, there are many "rules set up only on the training set", which makes the current law of changing a dataset inapplicable. In fact, as long as the leaves of a tree to allow enough nodes, training set can always be trained to 100% accuracy rate (big deal last leaves only one instance). Between the training accuracy and the actual accuracy (or test precision), the latter is what we want to really get.
We found that figure 1 in order to achieve 100% accuracy using 3 feature (Internet time, time, net purchase amount), wherein the branch "Internet time >1.1h" obviously has been fitted, this data set, a, B may happen a daily 1.09h online, b Internet for 1.05 hours, but using the Internet time is not >1.1 hours to judge the age of all people is clearly contrary to common sense;
In contrast, figure 2 boosting although using two trees, but in fact, only 2 feature to take care of, the latter feature is a question and answer ratio, obviously the basis of Figure 2 more reliable. (Of course, this is the LZ deliberately do the data, so can be reliable so dog blood.) In fact, the main advantage of the boosting is that the residual calculation of each step actually increases the weight of the wrong instance, while the instance that have been divided are tending to 0. In this way the trees in the back will be able to focus more and more on the instance that have been wrongly divided. Just like we do the Internet, always first solve the needs of 60% users, and then solve the needs of 35% users, and finally pay attention to the needs of 5% people, so that the product can be gradually done, because different types of user needs may be completely different, need to separate analysis. If you do it in reverse, or just come up to be perfect, often eventually will naught.
2) What about gradient? Not "G" BDT?
So far, we did not use the derivative of the gradient. In the current version of the GBDT description, it is true that gradient is not used, which uses residuals as the absolute direction of global optimization and does not require gradient solution.
3) It's not boosting, is it? AdaBoost is not so defined.
This is boosting, but not adaboost. GBDT is not adaboost decistion Tree. as mentioned in the decision tree everyone will think of C4.5, mention boost most people will also think of AdaBoost. AdaBoost is another boost method that uses these weight to calculate the cost function by assigning different weight according to the wrong classification, so that the "wrong sample weights are getting bigger and more valued". Bootstrap also has a similar idea, which does not change the model itself at each iteration, nor does it calculate residuals, but instead extracts n instance from n instance training sets (a single instance can be duplicated by sample). Training round the N new instance. As the data set changes the iterative model training results are not the same, and a instance by the front of the more wrong, the higher the probability of its being set, so that the same can be achieved gradually focus on the instance, gradually improve the effect. The AdaBoost method has been proven to be a good way to prevent overfitting, but it is not theoretically proven. GBDT can also introduce this option, which is also added in most implementations of Bootstrap RE-SAMPLING,GBDT, while using residuals, but there are different views as to whether or not to use them. Re-sampling a disadvantage is its randomness, that the same data set training two times the result is not the same, that is, the model is not stable to reproduce, which is a big challenge to assess, such as it is difficult to say whether a model is better because you choose the best feature, or because of the random factors of the sample.
Iv. Shrinkage
The idea of Shrinkage (shrinking) is that the effect of getting closer to the result every single step is easier to avoid than fitting in a way that quickly approaches results each time. That is, it does not fully trust each tree, it believes that each tree only learn a small part of the truth, accumulate only a small amount of time, by learning a few trees to compensate for the shortcomings. In terms of the equation, it is clearer
When no shrinkage is used: (Yi denotes the predicted value of Y on the first tree, Y (1~i) represents a synthetic predictor of the former I Tree y)
Y (i+1) = residuals (Y1~yi), Where: residuals (y1~yi) = y true Value-y (1 ~ i)
Y (1 ~ i) = SUM (y1, ..., Yi)
Shrinkage does not change the first equation, only the second equation is changed to:
Y (1 ~ i) = y (1 ~ i-1) + Step * Yi
That is, shrinkage still use residual as a learning goal, but for residual learning results, only a small fraction (step* residual) gradually approaching the target, step is generally relatively small, such as 0.01~0.001 (note that step is not gradient step), Causes the residuals of each tree to be gradual rather than mutation. Intuitively this is also very good understanding, not like directly with the residual error one step to repair errors, but only to repair a little bit, in fact, the big stride into a lot of small steps. Essentially, shrinkage sets a weight for each tree, multiplied by this weight, but not related to gradient. this weight is step. Like AdaBoost, shrinkage can reduce the occurrence of overfitting is also empirically proven, and has not yet been seen from the theoretical proof.
V. Scope of application of GBDT
This version of GBDT can be used for almost all regression problems (linear/nonlinear), and the relative logistic regression can only be used for linear regression, and GBDT has a very wide application surface. can also be used for two classification problems (set threshold value, greater than the threshold is a positive case, negative example).
Six, search engine sequencing application Ranknet
The search sort focuses on the order of each doc rather than on the absolute value, so a new cost function is required, and Ranknet basically defines the cost function, which can be compatible with different algorithms (GBDT, neural networks ...). )。
The actual search order is using the Lambdamart algorithm, which must be pointed out that theLambdamart iteration is not the residuals because the cost function required for sorting is used here. Lambda acts as a calculation method for alternative residuals, which uses a method similar to the gradient* step to simulate residuals. Here the mart in the solution method and the previous said residual difference slightly different, the difference is described here.
Just like all machine learning, the search sequencing learning also requires training set, which is generally implemented by manual labeling, that is, given a score for each (Query,doc) pair (such as 1,2,3,4), the higher the score is the more relevant, the more it should be queued to the front. However, these absolute scores are of little value in themselves, for example, it's hard to say that the difference between 1 and 2 documents is half of the 1 and 3-cent document gaps. Correlation is a very subjective judgment, and it is not possible to make this kind of quantitative labeling. But it's easy for the labeling staff to be "AB is good, but document A is more relevant than document B, so a is 4 points and B is 3 points." Ranknet is based on this to develop a learning error measurement method, namely cost function. Specifically, ranknet to any two documents A, B, through their manual labeling differences, the sigmoid function is used to estimate the probability of both order and reverse P1. And then the same kind of machine-learned difference calculates the probability P2 (the advantage of sigmoid is that it allows the machine to learn the score is any real value, as long as their difference and the standard division of the difference consistent, P2 approach P1). At this time the cross entropy of the two is obtained by P1 and P2, which is the cost function. The lower it is, the more the current sort of machine learning is approaching the label sort. In order to reflect the role of NDCG (NDCG is the most commonly used criterion in the search ranking industry), Ranknet is multiplied by NDCG in the cost function.
OK, now that we have the cost function, and it is related to the current score of each document, then although we do not know its global optimal direction, but can take the derivation of gradient,gradient is a descending direction of each document score of the n-dimensional vector, n is the number of documents (that is, the number of Query-doc pair). Here is just the "seek residual" logic to replace the "gradient", you can think: the gradient direction for each step of the optimal direction, the cumulative number of steps, always go to the local optimal point, if the point is exactly the most advantages of the global, and the effect is the same as the residual. By this time the logic of the previous talk, GDBT can already be on. So how does the final sort occur? Very simply, each sample through the shrinkage accumulation will get a final score, directly according to the score from the big to the small sort of (because machine learning is generated by the real field of the prediction, very rarely appear in manual labeling of the two common document scores equal, almost different to consider the same sub-document sorting method)
In addition, if the number of feature is too many, each tree will have to spend a lot of time, when each branch can randomly draw a portion of feature to traverse the optimal (Elf source code implementation mode).
"Reprint" GBDT (MART) Iteration decision tree Getting Started Tutorial | Brief introduction