Machine Learning -- gradient boost demo-tree (& treelink)

Source: Internet
Author: User

From: http://www.cnblogs.com/joneswood/archive/2012/03/04/2379615.html

1. What is treelink?

Treelink is the internal name of Alibaba Group. Its Academic name is gbdt (gradient boosting demo-tree, gradient escalation Decision Tree ). Gbdt is one of the two basic forms of algorithms related to "model combination + Decision Tree", and the other is random forest (random forest), which is simpler than gbdt.

1.1 Decision Tree

One of the most widely used classification algorithms, the result of model learning is a decision tree, which can be expressed as multiple if-else rules. The decision tree is actually a way to divide the space with a hyperplane.CurrentThe space is split into two parts, for example, the following decision tree:

In this wayEach leaf node is in a non-Intersecting area of the space.. After learning the decision tree above, when we input a sample instance to be classified for decision making, we can use the two features (x, y) of this sample) this sample is divided into a leaf node to obtain the classification result. This is the classification process of the decision tree model. The decision tree learning algorithms include ID3 and C4.5.

 

From: http://www.cnblogs.com/LeftNotEasy/archive/2011/01/02/machine-learning-boosting-and-gradient-boosting.html

At the end of the previous chapter, I mentioned that I have already written almost all the articles about linear classification. However, I suddenly heard that the Team has recently prepared a distributed classifier, it may be done using random forest. After reading a few papers, the simple random forest is easy to understand, and the complicated one will be combined with algorithms such as boosting (see iccv09 ), I don't know much about boosting, so I looked at it temporarily. Speaking of boosting, Jack has previously implemented a set of gradient
For more information about the boosting demo-tree (gbdt) algorithm, see.

Recently, some papers have found the advantages of model combination. For example, gbdt or RF combines simple models, which is better than a single and more complex model. There are many combinations of methods. randomization (such as random forest) and boosting (such as gbdt) are typical methods. Today we will mainly talk about the gradient boosting method (this method is different from the traditional boosting method) some of the mathematical basics of Freidman, with this mathematical foundation, can be seen in the above applications Freidman's gradient boosting
Machine.

This article requires readers to learn basic college mathematics and understand basic machine learning concepts such as classification and regression.

The main references for this article are PRML and gradient boosting machine.

 

Boosting method:

Boosting is actually a simple idea. It is probably a simple idea to create M models (for example, classification) for a piece of data. Generally, this model is relatively simple and is called a weak classifier (weak learner) each classification increases the weight of the data with the last error score by a little and then classifies the data. In this way, the final classifier can get better results in both test data and training data.

(Image from PRML p660) is a boosting process. The green line indicates the current model obtained (the model is obtained from the previous m models ), the dotted line indicates the current model. During each classification, we will pay more attention to the error data. The red and blue vertices are the data. The larger the vertices, the higher the weight. Let's look at the picture in the lower right corner, when M = 150, the obtained model can almost separate the red and blue dots.

Boosting can be expressed using the following formula:

There are N points in the training set. We can assign a weight wi (0 <= I <n) to each point in the training set to indicate the importance of this point, by training the model in sequence, we modify the weight of the point. If the classification is correct, the weight is reduced. If the classification is incorrect, the weight is increased. In the initial stage, the weights are the same. The green line indicates training models in turn,As you can imagine, the more the program executes, the more important the trained model will be to the points that are prone to errors (with a high weight.After all the programs are executed, M models will be obtained, corresponding to y1 (x )... Ym (X) is combined into a final model ym (x) by weighting ).

I think boosting is more like a person's learning process. When I start to learn something, I will do some exercises, but I often make mistakes in some simple questions, but the more I come to the end, if a simple question is no longer difficult for him, he will do more complex questions. After he has done a lot of questions, no matter the problem or the simple question can be solved.

 

Gradient boosting method:

In fact, boosting is more like an idea. Gradient boosting is a boosting method. Its main idea is that every time a model is created, the gradient descent direction of the loss function of the model is established before. The loss function describes the degree of unreliable model. The larger the loss function, it indicates that the model is more prone to errors (in fact, there is a problem of variance and deviation balancing, but here we assume that the larger the loss function, the more error the model will be ). If our model can reduce the loss function continuously, it means that our model is constantly improving, and the best way is to make the loss function in its gradient (gradient).

The following content describes gradient boosting in a mathematical way. It is not too complicated in mathematics. You can understand it as long as you dive into it :)

Gradient representation of the parameters that can be added:

Assume that our model can be represented by the following function, and P represents the parameter, which may contain multiple parameters, P = {P0, P1, P2 ....}, F (x; P) indicates the X function with P as the parameter, that is, our prediction function. Our model is composed of multiple models. β indicates the weight of each model, and α indicates the parameters in the model. To optimize f, We can optimize {β, α}, that is, P.

We still use P to represent the model parameters. We can obtain the likelihood function, that is, the loss function of Model F (x; P) =... The latter part looks very complicated. You just need to understand it as a loss function, so don't be scared away.


Since model (f (x; p) can be added, we can also obtain the following formula for parameter P:
In this way, the process of optimizing P can be a gradient descent process. Suppose we have obtained an on-1 model and want to obtain the M model, first, we need to calculate the gradient of M-1 model. To get the fastest descent direction, GM is the fastest descent direction.


Here is a very important assumption,For the first 1-1 model, we think it is known. Don't change it. Our goal is to build the model later.Just like when you do something, you have no regrets about what you did before. You only have to make efforts to avoid mistakes in the future:


The new model we get is that it is in the Gradient Direction of the P-likelihood function. P is the descent distance in the gradient direction.


In the end, we can optimize the formula below to obtain the optimal P:

Gradient representation of the added function:

The gradient descent method of the likelihood function of parameter P is obtained through the addition of parameter P. We can promote the addition of parameter P to the function space. We can get the following function. fi (x) Here is similar to h (x; α) above ), because this is used in the author's literature, I will use the author's expression method here:


Similarly, we can obtain the gradient descent direction G (x) of function f (x)


Finally, we can get the expression of the M model FM (x:

 

General gradient descent boosting framework:

Next I will deduce the general form of the gradient descent method, as discussed earlier:


For the model parameter {β, α}, we can use the following formula to represent it. This formula indicates that for N sample points (XI, Yi) calculate the loss function under Model F (x; α, β). The optimal {α, β} is the smallest {α, β} loss function }.
Two M-dimension parameters:


The gradient descent method is shown below, that is, the parameter {α m of the model FM (x) that we will obtain, β m} can make the direction of the FM is the fastest direction for the loss function of the model Fm-1 (x) obtained earlier:

For each data point Xi, we can get a GM (XI), and finally we can get a complete gradient descent direction.


To enable FM (X) to be in the direction of GM (x), we can optimize the following formula and use the least square method:


Then we can obtain β m Based on α.
Merged to the model:

The flowchart of the algorithm is as follows:


Later, the author also talked about the promotion of this algorithm in other places. Among them, multi-class logistic regression and classification is an implementation of gbdt. Let's take a look, the flowchart is similar to the preceding algorithm. I am not going to continue writing it here. I will translate it into a thesis. Please refer to the article: greedy function approximation-a gradient boosting machine, Author: Freidman.

Simple: http://www.cnblogs.com/LeftNotEasy/archive/2011/03/07/random-forest-and-gbdt.html

Summary:

This article mainly talks about the methods of boosting and gradient boosting. Boosting is mainly an idea that indicates "correct when you know something wrong ". Gradient boosting is an optimization method for a function (or model) under this idea. It first breaks down the function into the form that can be added (in fact, all the functions can be added, is it better to put it in this framework and the final effect ). Then perform M iterations to reduce the loss function in the gradient direction, and finally obtain an excellent model. It is worth mentioning that the reduction of each model in the gradient direction can be considered as a "small" or "weak" model, in the end, we will combine these "weak" models by weighting (that is, every time the distance decreases in the gradient direction) to form a better model.

With this gradient descent, you can do a lot of things. It is also on the road of machine learning :)

1.4 treelink Model

Unlike the decision tree model, treelink is composed of only one decision tree. Instead, it is composed of multiple decision trees, usually hundreds of rows, in addition, the size of each tree is small (that is, the depth of the tree is relatively small ). During model prediction, an initial value is assigned to an input sample instance, and each decision tree is traversed. The predicted values are adjusted and corrected, finally, the prediction result is as follows:

F0 is the initial value and Ti is a decision tree. For different problems (regression or classification problems) and different loss functions, the initial value settings are different. For example, if the Gaussian loss function is selected for regression, the initial value is the mean value of the target of the training sample.

Treelink naturally includes the idea of boosting: It combines a series of weak classifiers to form a strong classifier. It does not require every tree to learn too much. Every tree learns a little bit of knowledge, and then accumulates the learned knowledge to form a powerful model.

The learning process of the treelink model is the process of building multiple decision trees. In the process of building a tree, the most important thing is to look for a split point (a value of a feature ). In the treelink algorithm, the degree of reduction of the loss function is used to measure the sample differentiation capability of feature split points. The more loss is reduced, the better the split point. That is, a split point is used to divide the sample into two parts, which minimizes the loss function value of the split sample.

Treelink training:
1. Estimate the initial values;

2. Create M trees as follows:

-Update the estimated values of all samples;

-Randomly select a subset of samples;

-Create J leaves as follows;

• For all current leaves

-Update the estimated value, calculation gradient, optimal division point, optimal growth volume, and gain (reduce the loss function by a small amount );

-Select the leaf with the largest gain and Its dividing points to split and split the Sample Subset at the same time;

-Refresh the growth volume to the leaves;

Treelink prediction:

Assign the estimated value to the target;

• For M trees:

-Locate the leaf node based on the decision tree path based on the features (x) of the input data;

-Update the target estimation value using the growth volume on the leaf node;

• Output results;

For example, gbdt obtains three decision trees. When a sample point is predicted, it will also fall into three leaf nodes, and its gain is (assuming it is a problem of 3 categories ):

(0.5, 0.8, 0.1), (0.2, 0.6, 0.3), (0.4, 0.3, 0.3 ),

In this way, the final classification result is the second one, because the decision tree with Category 2 selected is the most.

2. treelink binary classification performance test

2.1 experiment purpose and data preparation

The transaction history data is used to predict the transaction status of a seller through treelink, that is, whether there is a deal, which is converted into a binary classification problem.

Data format: Target feature1 feature2... Feature13, where target has only two values: 0 indicates no deal, 1 indicates yes, 0 indicates no deal, and each sample is described by 13 feature. In the sample data, the number of positive samples (target = 1) is 23285, the number of negative samples is 20430, and the ratio is 1.14: 1. All sample data is distributed to the training set and the test set. The training set includes 33715 samples and the test set includes 10000 samples.

2.2 model parameter settings

The test environment is carried out using the treelink module in the mllib 1.2.0 toolkit. The main task is to adjust the parameters of treelink during the training process based on the descent trend and prediction accuracy of the loss function, finally, an optimal parameter combination is found. The configuration files involved include:

Mllib. conf:

[Data]

Filter = 0

Sparse = 0

Weighted = 0

[Model]

Cross_validation = 0

Evaluation_type = 3

Variable_importance_analysis = 1

Cfile_name =

Model_name = treelink

LOG_FILE = Log File Storage path

Model_file = model file storage path

Param_file = treelink configuration file storage path

Data_file = storage path of the sample data file

Result_file = path for saving the classification result File

 

Treelink. conf:

[Treelink]

Tree_count = 500

Max_leaf_count = 4

Max_tree_depth = 4

Loss_type = Logistic

Shrinkage = 1, 0.15

Sample_rate = 0.66

Variable_sample_rate = 0.8

Split_balance = 0

Min_leaf_sample_count = 5

Discrete_separator_type = leave_one_out

Fast_train = 0

Tree_count: the number of decision trees. The larger the number of decision trees, the more adequate the learning, but the larger the number of decision trees will cause over-fitting and consume training and prediction time. You can select a relatively large number of trees, and then observe the loss reduction trend during the training process. When the loss reduction is gentle, the number of trees is more appropriate. Tree_count and shrinkage are also related. The larger the shrinkage, the faster the learning speed, and the fewer trees required.
Shrinkage: The step size indicates the learning speed. The smaller the value indicates that the learning is more conservative (slow), and the larger the value indicates that the learning is more aggressive (FAST ). Generally, you can set shrinkage to a smaller value and the number of trees to a larger value.
Sample_rate: sample sampling rate. to construct a model with different tendencies, we need to use a subset of the samples for training. Excessive samples may cause more overfitting and local extremely small problems. The sampling ratio is generally 50%-70%.
Variable_sample_rate: feature sampling rate, which refers to learning from the features selected from all the features of the sample without using all the features. When one or two features of the trained model are found to be very strong and important, and other features are basically unavailable, you can set this parameter to a value smaller than 1.

Loss_type: pay special attention when setting this parameter. Make sure that the optimization goal is consistent with the loss function. Otherwise, the loss function will not decrease but increase during training.

For more information about other parameters, see mlllib user manual.

2.3 experiment results

The classification performance of the treelink model was evaluated using lift, F_1, and AUC.

Lift: This indicator measures how much the "prediction" capability of the model is improved compared to that of the model not used. The larger the model lift (Lifting index), the better the model running effect. Lift = 1 indicates that the model has not been upgraded.

F_1: a comprehensive indicator of coverage and accuracy. The formula increases as precision and coverage increase at the same time:

AUC:AReaUNder ROCCUrve, whose value is equal to the area under the ROC curve, between 0.5 and 1. The larger AUC represents better performance. There are multiple methods for calculating AUC. The methods used in this experiment are as follows:

Set the number of positive samples to m, the number of negative samples to N, and the number of samples to n = m + n. First, the score is sorted in ascending order, and then the rank of the sample corresponding to the largest score is N, and the rank of the sample corresponding to the second largest score is n-1, and so on. Then, sum the rank values of all positive samples, and subtract the m value whose score is the smallest. The score of the positive sample is greater than that of the negative sample. The score is divided by m x n. Note that when scores are equal, you must assign the same rank value. The specific operation is to take the rank of all the samples with the same score to the average.

Note: The lift, F_1, and ROC curves can be obtained through the R language environment machine learning package. Common tools are not found for AUC calculation. Therefore, they are implemented using Python programming.

 

Model Evaluation Result:

AUC = 0.9999, lift = 1.9994, F_1 = 0.9999

Note: The above evaluation results are the classification performance of treelink after a long period of parameter adjustment, and the results are too good. Currently, it is not possible to determine whether there is any overfitting, verification can be performed only after multiple experiments are performed using other data.

Challenge with treasure and reverence

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.