Python Machine learning Case series Tutorial--LIGHTGBM algorithm

Source: Internet
Author: User
Tags xgboost

Full Stack Engineer Development Manual (author: Shangpeng)

Python Tutorial Full solution installation

Pip Install LIGHTGBM

Gitup Web site: Https://github.com/Microsoft/LightGBM Chinese Course

http://lightgbm.apachecn.org/cn/latest/index.html LIGHTGBM Introduction

The emergence of xgboost, let data migrant workers farewell to the traditional machine learning algorithms: RF, GBM, SVM, LASSO ... Now Microsoft has launched a new boosting framework that wants to challenge Xgboost's position.

As the name suggests, LIGHTGBM contains two key points: Light is lightweight, GBM gradient hoist.

LIGHTGBM is a gradient boosting framework using decision trees based on learning algorithms. It can be said to be distributed, efficient and has the following advantages:

Faster training efficiency

Low Memory usage

A higher rate of accuracy

Support for parallelization of learning

Can handle large scale data xgboost shortcomings

Its disadvantages, or deficiencies:

Each iteration requires traversing the entire training data multiple times. If you put the entire training data into memory, you will limit the size of the training data, and if you don't load the memory, it will take a lot of time to read and write the training data again and again.

Pre-sorting method (pre-sorted): First, the space consumption is big. Such algorithms need to preserve the eigenvalues of the data, as well as the results of feature sequencing (such as sorted indexes, for subsequent fast computing of partition points), which consumes twice times the memory of the training data. The second time also has a larger cost, in traversing each of the points, the need for splitting gain calculation, the cost of consumption is large.

Not friendly to cache optimization. After ordering, the access of feature to gradient is a kind of random access, and different feature access order is not the same, the cache can not be optimized. At the same time, in each layer of long trees, it is necessary to randomly access a row index to the index of the leaves of the array, and different features access to the same order, will also result in larger cache miss. LIGHTGBM features

The above is not so much a xgboost, but rather a point that the LIGHTGBM authors focus on when they build new algorithms. Solve the problem, then the original model did not solve the shortcomings of the original model.

Generally speaking, LIGHTGBM mainly has the following characteristics:

A decision tree algorithm based on histogram

Leaf growth strategy of leaf-wise with depth limitation

Histogram Do difference acceleration

Direct support category features (categorical Feature)

Cache Hit Rate optimization

Sparse feature optimization based on histogram

Multithreading optimization

The top 2 features make us particularly concerned.

Histogram algorithm

The basic idea of the histogram algorithm is that the continuous floating-point eigenvalue is discretized into k integers, and a histogram with a width of k is constructed. Traversing the data, according to the value of the discretization as the index in the histogram cumulative statistics, when traversing the data, the histogram accumulated the required statistics, and then according to the discrete values of the histogram, traversal search for the best segmentation points.

leaf growth strategy of leaf-wise with depth limitation

Level-wise once data can split the same layer of leaves, easy to multithreading optimization, or control the complexity of the model, not easy to fit. But in fact level-wise is a inefficient algorithm, because it treats the same layer of leaves without distinction, which brings a lot of unnecessary overhead, because in fact many of the leaves have low splitting gain and no need for searching and splitting.

Leaf-wise is a more efficient strategy: each time from the current leaf, find the most split-gain one of the leaves, and then split, so cycle. Therefore, compared with level-wise, the leaf-wise can reduce more errors and obtain better accuracy in the case of the same splitting times.

The disadvantage of leaf-wise: it is possible to grow a deeper decision tree and have a fit. Therefore, the LIGHTGBM adds a maximum depth limit above the leaf-wise, preventing the fitting at the same time as ensuring high efficiency. Xgboost and LIGHTGBM

Decision Tree Algorithm

Xgboost uses the pre-sorted algorithm to locate data dividers more precisely; First, you sort all the features by numeric values. Secondly, at each sample partition, the cost of O (# data) is used to find the optimal partition point for each feature. Finally, the final feature and the segmentation point are found, splitting the data into the left and right two subnodes.

Disadvantages

This pre-sorting algorithm can pinpoint the splitting point, but it has a lot of overhead in space and time. I. memory requires twice times more training data because of the need to reorder features and to save sorted index values (for subsequent fast computing of splitting points). Ii. the calculation of splitting gain is required to traverse each point of separation, and the cost of consumption is high.

LIGHTGBM uses the histogram algorithm, which consumes less memory and has less complexity in data separation.

The idea is to disperse the continuous floating-point feature into K-discrete values and construct the histogram with a width of K. It then traverses the training data and counts the cumulative statistics of each discrete value in the histogram. When making feature selection, we only need to traverse the discrete value of the histogram to find the optimal segmentation point.

Advantages and disadvantages of histogram algorithm: Histogram algorithm is not perfect. Because the feature is discretized, the found is not a very precise segmentation point, so it will have an impact on the results. But in the actual data set, it is shown that the discrete splitting point has little effect on the final precision, even better. The reason is that decision tree itself is a weak learner, using histogram algorithm will play a regular effect, effectively prevent the model of the past fit. The cost of time is reduced from the original O (#data * #features) to O (k * #features). Because of the discretization, #bin远小于 #data, so there is a great increase in time. The histogram algorithm can also be accelerated further. The histogram of a leaf node can be obtained directly by the histogram of the parent node and the histogram of the sibling node. In general, the construction histogram needs to traverse all the data on the leaf, through which only the histogram of the K-poke is traversed. The speed has been raised by one fold.

decision tree growth strategy

Xgboost is based on a layer growth level (depth)-wise growth strategy, as shown in Figure 1, can split the same layer of leaves, so that multithreading optimization, not easy to fit, but the indiscriminate treatment of the same layer of leaves, brings a lot of unnecessary overhead. Because in fact many of the leaves have a lower splitting gain, there is no need to search and split.

LIGHTGBM uses the Leaf-wise growth strategy, as shown in Figure 2, to find a leaf that has the largest split gain (typically the largest amount of data) from all current leaves and then splits and loops. Therefore, compared with level-wise, the leaf-wise can reduce more errors and obtain better accuracy in the case of the same splitting times. The disadvantage of leaf-wise is that it is possible to grow a deeper decision tree and produce a fitting. Thus LIGHTGBM adds a maximum depth limit above the leaf-wise to prevent the fitting at the same time as it ensures high efficiency.

Network communication Optimization

Xgboost because of the pre-sorted algorithm, the communication cost is very high, so in parallel is also the use of histogram algorithm, LIGHTGBM using the histogram algorithm communication cost is small, through the use of Set communication algorithm, can achieve parallel computing linear acceleration.

LIGHTGBM support category features

In fact, most machine learning tools can not directly support the category characteristics, the general need to transform the class characteristics, one-hotting characteristics, reduce space and time efficiency. The use of class characteristics is very common in practice. Based on this consideration, LIGHTGBM optimizes support for class features by directly entering category features without the need for an additional 0/1 expansion. And the decision rules of the class feature are added to the decision tree algorithm. LIGHTGBM Tuning Parameters

All parameter meanings, reference: http://lightgbm.apachecn.org/cn/latest/Parameters.html

The process of parameter tuning:

(1) num_leaves

LIGHTGBM uses the leaf-wise algorithm, so when adjusting the complexity of the tree, it uses num_leaves instead of max_depth.

Approximate conversion relationship: Num_leaves = 2^ (max_depth)

(2) Sample distribution unbalanced data set: Can param[' is_unbalance ']= ' true '

(3) Bagging parameter: Bagging_fraction+bagging_freq (must be set at the same time), feature_fraction

(4) LIGHTGBM example of Min_data_in_leaf, Min_sum_hessian_in_leaf Sklearn interface forms

This is mainly used in the form of Sklearn to use the LIGHTGBM algorithm, including modeling, training, forecasting, grid parameter optimization.

Import LIGHTGBM as LGB import pandas as PD from Sklearn.metrics import mean_squared_error from Sklearn.model_selection Imp ORT GRIDSEARCHCV from sklearn.datasets import Load_iris to sklearn.model_selection import train_test_split from Sklearn 
. Datasets Import make_classification # Load Data print (' Load data ... ') Iris = Load_iris () data=iris.data target = Iris.target X_train,x_test,y_train,y_test =train_test_split (data,target,test_size=0.2) # Df_train = Pd.read_csv ('. /regression/regression.train ', header=none, sep= ' t ') # df_test = Pd.read_csv ('.  /regression/regression.test ', header=none, sep= ' t ') # y_train = df_train[0].values # y_test = df_test[0].values # X_train  = Df_train.drop (0, Axis=1). Values # x_test = Df_test.drop (0, Axis=1). Values print (' Start training ... ') # Create model, training model GBM = Lgb. Lgbmregressor (objective= ' regression ', num_leaves=31,learning_rate=0.05,n_estimators=20) gbm.fit (X_train, Y_train, eval_set=[(X_test, y_test)],eval_metric= ' L1 ', early_stopping_rounds=5) print (' Start predIcting ... ') # test machine Forecast y_pred = gbm.predict (X_test, Num_iteration=gbm.best_iteration_) # Model evaluates print (' Rmse of prediction I S: ', Mean_squared_error (y_test, y_pred) * * 0.5) # feature importances print (' Feature importances: ', List (gbm.feature_imp Ortances_)) # Grid search, parameter optimization estimator = LGB. Lgbmregressor (num_leaves=31) Param_grid = {' Learning_rate ': [0.01, 0.1, 1], ' n_estimators ': [A]} GBM = Gr IDSEARCHCV (Estimator, Param_grid) Gbm.fit (X_train, Y_train) print (' Best parameters found by grid search are: ', gbm.best_
 PARAMS_)
native form using LIGHTGBM
# coding:utf-8 # pylint:disable = invalid-name, C0111 import JSON import LIGHTGBM as LGB import pandas as PD from Sklear N.metrics Import mean_squared_error from sklearn.datasets import load_iris from sklearn.model_selection import train_ Test_split from sklearn.datasets import make_classification iris = Load_iris () data=iris.data target = Iris.target X_tra In,x_test,y_train,y_test =train_test_split (data,target,test_size=0.2) # load your data # print (' Load data ... ') # Df_train = PD.R Ead_csv ('.. /regression/regression.train ', header=none, sep= ' t ') # df_test = Pd.read_csv ('. /regression/regression.test ', header=none, sep= ' t ') # y_train = df_train[0].values # y_test = df_test[0].values # X_tra in = Df_train.drop (0, Axis=1). Values # x_test = Df_test.drop (0, Axis=1). Values # The data set format of the LGB feature was created Lgb_train = LGB. Dataset (X_train, y_train) Lgb_eval = LGB.  Dataset (X_test, Y_test, Reference=lgb_train) # writes arguments in dictionary form params = {' Task ': ' Train ', ' boosting_type ': ' GBDT ', # Set elevation type ' objective ': ' REgression ', # target function ' metric ': {' L2 ', ' AUC '}, # evaluate function ' num_leaves ': 31, # leaf node number ' learning_rate ': 0.05, # Learning rate ' Feature_fraction ': 0.9, # The feature selection ratio ' bagging_fraction ': 0.8, # The sample proportion of the contribution ' Bagging_freq ': 5, # k means every k iteration Line bagging ' verbose ': 1 # <0 display fatal, = 0 Display error (warning), >0 display information} print (' Start training ... ') # Training CV and train GBM = LG B.train (params,lgb_train,num_boost_round=20,valid_sets=lgb_eval,early_stopping_rounds=5) print (' Save model ... ') # Save model to File Gbm.save_model (' Model.txt ') print (' Start predicting ... ') # predictive DataSet y_pred = Gbm.predict (X_test, NUM_ITERATION=GBM
 . best_iteration) # Evaluate model print (' The Rmse of prediction is: ', Mean_squared_error (y_test, y_pred) * * 0.5)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.