"Young Mans, in the mathematics you don ' t understand things. You just get used to them. "
Xgboost (eXtreme Gradient boosting) algorithm is an efficient implementation version of Gradient boosting algorithm, because it shows good effect and efficiency in application practice, so it is widely admired by industry.
To understand the principle of the xgboost algorithm, we first need to understand the boosting algorithm. Simply put, theboosting algorithm is a machine learning method that sets the individual learner's set into a more complex learner , which emphasizes the existence of a strong dependency between individual learners, so it can also be considered as a serial integrated learning method. In contrast, the bagging algorithm belongs to the parallel integrated learning method.
The basic principle of the boosting algorithm is: First use the initial sample training a base learner, according to the learning performance of the sample distribution adjustment, so that the poor performance of the sample to get more attention, and then continue to iteratively adjust the distribution of the sample to train the next base learner, until the number of base learners to reach a specified number.
The most famous of the boosting algorithm family is the AdaBoost algorithm Freund proposed in 1997, which can be interpreted as "additive model", that is, the final learner is the "weighted sum" model of T-base learners, and each time a new base learner is trained with a sample of training errors, And through weight adjustment to reduce the effect of the previous poorly behaved learners, and ultimately reduce the result deviation of the integrated learning device.
In 2001, Freund proposed the Gradient boosting framework, extending the loss function to a more general case , i.e., the residual error of regression fitting by the gradient (exactly what should be called pseudo-residuals), based on the residual poor students into a new base-learning device, and calculates its optimal superposition weight value. If the base classifier is selected as a decision tree (such as a cart tree), the GBDT algorithm is the corresponding.
The cart tree is a classification tree and does not meet the requirements of the GB regression loss function, so it is necessary to replace the classification Gini coefficient index with the minimum mean variance, that is, when all the branching predictions are unique or reach the number of leaf nodes, the predicted mean value of all nodes of the tree is used as the prediction result of the classifier. In addition, GBDT also borrowed some ideas of bagging integrated learning, such as improving the generalization ability of the model by random sampling, selecting the optimal parameters by cross-validation.
In 2014, Dr. Chen Tianchi proposed the xgboost algorithm, which can be considered as a further optimization based on the GBDT algorithm. First, the Xgboost algorithm introduces a regular term in the loss function of the base learner, and controls the over-fitting in the training process. Secondly, the xgboost algorithm not only calculates pseudo-residuals using first-order derivative, but also computes a new base-learning device which can approximate fast pruning of second derivative. The xgboost algorithm also makes a lot of engineering optimizations, such as supporting parallel computing, improving computational efficiency, processing sparse training data, and so on.
Based on the analysis, thexgboost algorithm is derived from the boosting integrated learning method, which incorporates the advantages of the bagging integrated learning method in the evolutionary process, and improves the ability of the algorithm to solve the common problems through the gradient boosting framework custom loss function. At the same time, the introduction of more controllable parameters can be optimized for the problem scenario, finally through the engineering implementation of the details of optimization, in order to ensure the stability of the results of the algorithm can also be efficient processing of large-scale data, extensible support for different programming languages. Together, these factors make it one of the mainstream machine learning algorithms in the industry .
The xgboost algorithm means that the friend, can be in the data small shrimp public number back backstage reply to the key word "xgboost", download Dr. Chen Tianchi's thesis ppt carefully taste.
The Data science martial arts is surging,
With data small shrimp total Chuangjianghu ~
?
Brief analysis on the principle of xgboost algorithm