Treelink Model Test report

Source: Internet
Author: User

1.What isTreelink

Treelink is the internal name of Alibaba Group. Its Academic name is gbdt (gradient boosting demo-tree, gradient escalation Decision Tree ). Gbdt is related to "model combination + decision tree"AlgorithmThe other is the random forest (random forest), which is simpler than gbdt.

1.1Decision tree

One of the most widely used classification algorithms, the result of model learning is a decision tree, which can be expressed as multiple if-else rules. The decision tree is actually a way to divide the space with a hyperplane.CurrentThe space is split into two parts, for example, the following decision tree:

In this wayEach leaf node is in a non-Intersecting area of the space.. After learning the decision tree above, when we input a sample instance to be classified for decision making, we can use the two features (x, y) of this sample) this sample is divided into a leaf node to obtain the classification result. This is the classification process of the decision tree model. The decision tree learning algorithms include ID3 and C4.5.

1.2BoostingMethod

Boosting is an idea about model combination (another important idea is bagging ). The boosting algorithm is always an iterative process. Every new training is to improve the previous result. The original boosting algorithm adds a weight value to each sample at the beginning of the algorithm. At the beginning, everyone is equally important. The model obtained during each step of training makes the estimation of data points correct and wrong. Therefore, after each step is completed, the weight of the error points is added, this reduces the weight of the point to which the score is correct, so that if some points are always divided incorrectly, they will be "seriously concerned", that is, assigned a high weight. Then, after M iterations, we will get m Simple classifiers (Basic learner ), then we combine them (for example, we can weight them or let them vote) to get a final model. Boosting process:

The green line in the figure indicates the current model (composed of the previous m models), and the dotted line indicates the current basic learner. During each classification, the model pays more attention to the error data. The red and blue vertices in the figure represent data. The larger the vertices, the higher the weight. When M = 150, the obtained model can almost partition the red and blue vertices. Boosting can also be expressed using the following formula:

In formula, ym (x) is the final model obtained by learning. In short, the biggest feature of the boosting algorithm is: "change if you know the error "!

1.3Gradient boosting

Gradient boosting is a boosting method. The difference between gradient boosting and traditional boosting is that each computation aims to reduce the residual of the previous computation. To eliminate the residual, you canGradientCreate a new model. Therefore, in gradient boosting, each new model is created to reduce the residual values of the previous model in the gradient direction, it is different from traditional boosting in weighting correct and wrong samples.

1.4TreelinkModel

Unlike the decision tree model, treelink is composed of only one decision tree. Instead, it is composed of multiple decision trees, usually hundreds of rows, in addition, the size of each tree is small (that is, the depth of the tree is relatively small ). During model prediction, an initial value is assigned to an input sample instance, and each decision tree is traversed. The predicted values are adjusted and corrected, finally, the prediction result is as follows:

F0 is the initial value and Ti is a decision tree. For different problems (regression or classification problems) and different loss functions, the initial value settings are different. For example, if the Gaussian loss function is selected for regression, the initial value is the mean value of the target of the training sample.

Treelink naturally includes the idea of boosting: It combines a series of weak classifiers to form a strong classifier. It does not require every tree to learn too much. Every tree learns a little bit of knowledge, and then accumulates the learned knowledge to form a powerful model.

The learning process of the treelink model is the process of building multiple decision trees. In the process of building a tree, the most important thing is to look for a split point (a value of a feature ). In the treelink algorithm, the degree of reduction of the loss function is used to measure the sample differentiation capability of feature split points. The more loss is reduced, the better the split point. That is, a split point is used to divide the sample into two parts, which minimizes the loss function value of the split sample.

Treelink training:
1. Estimate the initial values;

2. Create M trees as follows:

-Update the estimated values of all samples;

-Randomly select a subset of samples;

-Create J leaves as follows;

• For all current leaves

-Update the estimated value, calculation gradient, optimal division point, optimal growth volume, and gain (reduce the loss function by a small amount );

-Select the leaf with the largest gain and Its dividing points to split and split the Sample Subset at the same time;

-Refresh the growth volume to the leaves;

Treelink prediction:

Assign the estimated value to the target;

• For M trees:

-Locate the leaf node based on the decision tree path based on the features (x) of the input data;

-Update the target estimation value using the growth volume on the leaf node;

• Output results;

For example, gbdt obtains three decision trees. When a sample point is predicted, it will also fall into three leaf nodes, and its gain is (assuming it is a problem of 3 categories ):

(0.5, 0.8, 0.1), (0.2, 0.6, 0.3), (0.4, 0.3, 0.3 ),

In this way, the final classification result is the second one, because the decision tree with Category 2 selected is the most.

2.TreelinkBinary classification performance test

2.1Lab purpose and data preparation

The transaction history data is used to predict the transaction status of a seller through treelink, that is, whether there is a deal, which is converted into a binary classification problem.

Data format: Target feature1 feature2... Feature13, where target has only two values: 0 indicates no deal, 1 indicates yes, 0 indicates no deal, and each sample is described by 13 feature. In the sample data, the number of positive samples (target = 1) is 23285, the number of negative samples is 20430, and the ratio is 1.14: 1. All sample data is distributed to the training set and the test set. The training set includes 33715 samples and the test set includes 10000 samples.

2.2Model parameter settings

The test environment is carried out using the treelink module in the mllib 1.2.0 toolkit. The main task is to adjust the parameters of treelink during the training process based on the descent trend and prediction accuracy of the loss function, finally, an optimal parameter combination is found. The configuration files involved include:

Mllib. conf:

[Data]

Filter = 0

Sparse = 0

Weighted = 0

[Model]

Cross_validation = 0

Evaluation_type = 3

Variable_importance_analysis = 1

Cfile_name =

Model_name = treelink

LOG_FILE = Log File Storage path

Model_file = model file storage path

Param_file = treelink configuration file storage path

Data_file = storage path of the sample data file

Result_file = path for saving the classification result File

 

Treelink. conf:

[Treelink]

Tree_count = 500

Max_leaf_count = 4

Max_tree_depth = 4

Loss_type = Logistic

Shrinkage = 1, 0.15

Sample_rate = 0.66

Variable_sample_rate = 0.8

Split_balance = 0

Min_leaf_sample_count = 5

Discrete_separator_type = leave_one_out

Fast_train = 0

Tree_count: the number of decision trees. The larger the number of decision trees, the more adequate the learning, but the larger the number of decision trees will cause over-fitting and consume training and prediction time. You can select a relatively large number of trees, and then observe the loss reduction trend during the training process. When the loss reduction is gentle, the number of trees is more appropriate. Tree_count and shrinkage are also related. The larger the shrinkage, the faster the learning speed, and the fewer trees required.
Shrinkage: The step size indicates the learning speed. The smaller the value indicates that the learning is more conservative (slow), and the larger the value indicates that the learning is more aggressive (FAST ). Generally, you can set shrinkage to a smaller value and the number of trees to a larger value.
Sample_rate: sample sampling rate. to construct a model with different tendencies, we need to use a subset of the samples for training. Excessive samples may cause more overfitting and local extremely small problems. The sampling ratio is generally 50%-70%.
Variable_sample_rate: feature sampling rate, which refers to learning from the features selected from all the features of the sample without using all the features. When one or two features of the trained model are found to be very strong and important, and other features are basically unavailable, you can set this parameter to a value smaller than 1.

Loss_type: pay special attention when setting this parameter. Make sure that the optimization goal is consistent with the loss function. Otherwise, the loss function will not decrease but increase during training.

For more information about other parameters, see mlllib user manual.

2.3Lab results

The classification performance of the treelink model was evaluated using lift, F_1, and AUC.

Lift: This indicator measures how much better the "prediction" capability of the model is than that of the model that is not used. The larger the model lift (Lifting index), the better the model running effect. Lift = 1 indicates that the model has not been upgraded.

F_1: a comprehensive indicator of coverage and accuracy. The formula increases as precision and coverage increase at the same time:

AUC:AReaUNder ROCCUrve, whose value is equal to the area under the ROC curve, between 0.5 and 1. The larger AUC represents better performance. There are multiple methods for calculating AUC. The methods used in this experiment are as follows:

Set the number of positive samples to m, the number of negative samples to N, and the number of samples to n = m + n. First, the score is sorted in ascending order, and then the rank of the sample corresponding to the largest score is N, and the rank of the sample corresponding to the second largest score is n-1, and so on. Then, sum the rank values of all positive samples, and subtract the m value whose score is the smallest. The score of the positive sample is greater than that of the negative sample. The score is divided by m x n. Note that when scores are equal, you must assign the same rank value. The specific operation is to take the rank of all the samples with the same score to the average.

Note: The lift, F_1, and ROC curves can be obtained through the R language environment machine learning package. Common tools are not found for AUC calculation. Therefore, they are implemented using Python programming.

 

Model Evaluation Result:

AUC = 0.9999,Lift = 1.9994,F_1 = 0.9999.

Note: The above evaluation results are the classification performance of treelink after a long period of parameter adjustment, and the results are too good. Currently, it is not possible to determine whether there is any over-fitting, verification can be performed only after multiple experiments are performed using other data.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.