Application of LR (Logistic regression) & Xgboost in CRT
This article will continue to update, Welcome to guide the Exchange ~
Determined to be a good alchemist I started the CRT to suddenly stress Alexander. The data is the most important reason, and after all, adjust less, slowly save some experience.
In the CRT, the two biggest problems are:
-Uneven data. The number of samples that are actually converted in a large number of advertisements is very small.
-Sparse data. The feature information for each sample is not very complete.
LR and xgoost are two commonly used models in CTR, each with advantages and disadvantages, using Xgboost (extract feature) + LR (prediction) in Facebook. The GBDT model is good at dealing with continuous eigenvalues, while LR is good at dealing with discrete eigenvalue values. In Xgboost, the continuous eigenvalue input xgboost, after training good model, get K number, each tree has n1,n2,..., nK n_1, n_2, ..., n_k a leaf node. Each prediction specimen falls on a leaf node in each tree, then with the leaf node ID on the tree as the characteristic value of the sample, we get a sparse feature of N1+n2+...+nk n_1 + n_2 + ... + n_k dimension, which has a K value of 1, and the rest value is 0. After the feature is obtained, the input into the LR is carried out two times with some other discrete features. Why does the feature of logistic regression LR be discretized first?
In industry, it is very rare to give continuous values as a feature to the logistic regression model, but to separate the continuous features into a series of 0 and 1 features to the logistic regression model, which has the following advantages:
1. Sparse vector internal product multiplication operation speed, the calculation result convenient storage, easy scalable (expansion).
2. The discrete features have strong robustness against outliers: for example, a feature is the age >30 is 1, otherwise 0. If the feature is not discretized, an exception data "Age 300 years" can cause a great disturbance to the model.
3. The logical regression belongs to the generalized linear model, the expression ability is limited; After the single variable is discretized into N, each variable has a separate weight, which is equivalent to introducing nonlinearity into the model, which can enhance the model expression ability and enlarge the fitting.
4. The discretization can be characterized by crossover, from M+n variable to m*n variable, further introducing Non-linear, enhance the expression ability.
5. Feature discretization, the model will be more stable, for example, if the user age discretization, 20-30 as an interval, not because a user aged one year old to become a completely different person. Of course, the samples in the vicinity of the interval will be just the opposite, so how to divide the interval is a learning discipline.
The approximate understanding: the computation simple simplification model enhancement model's generalization ability, is not easy to be affected by the noise xgboost the multiple classification question loss function is the Softmax function.
In Xgboost, if two is classified, there is only one regression number in each iteration, using the logistic cross entripy loss. But if it is a multiple classification problem, the assumption is 10 categories, then each iteration will have 10 trees, which is the probability of each tree corresponding to a category, the calculation should be Softmax loss (to be refined to confirm ). LR and xgboost How to handle missing values.
LR generally to the characteristics of the first one-hot processing, in the process of processing, the missing value of each feature will be separate as 1 columns, whether missing. In Xgboost, the missing value is substituted with Np.nan. In the course of training, Xgboost should not consider the missing samples when calculating the important degree of the feature. how does LR and xgboost solve the problem of data imbalance? what is required to be aware of the LR tuning reference.
In fact, the parameters of LR are relatively few, which is the basis of the regularization method, the penalty factor, especially in the two classification problem when the parameters to be adjusted less. More important is to do feature processing, feature selection. C: Penalty factor, according to the test set validation set of performance to adjust it. Tol: Stops the iteration when the difference between the two iterations is less than Tol. This should be adjusted according to your own data. VERBOSE: Output training content, in the beginning of the experiment in general look at the training content easier to help find ways to improve.
In LR, divide well in general, many other models can do better than LR, so few people specifically on the LR of the tuning to do a lot of work. If our project, positive and negative samples should be linear, so that no matter what the solution algorithm should be able to get a more accurate solution.
On the understanding of the data in LR, and the difference between LR and perceptron, reference: In the linear data set, the Perceptron model is convex optimization problem. what should be paid attention to in xgboost tuning.
Xgboost parameters are much more complicated than LR, but the effect is generally better. In the Xgboost training, we generally divide the training set test set, will set a watch_list, in the course of training, by observing the performance on the test set, take the optimal model. Such as:
Watchlist = [(Dm_train, ' Train '), (dm_test, ' Test ')]
Best_model = Xgb.train (params, Dm_train, N_round, evals= Watchlist, early_stopping_rounds=n_stop_round)
Y_pred_prob = Best_model.predict (Dm_test, Ntree_limit=best_ Model.best_ntree_limit)
The problem is that the resulting model was chosen according to the test set, and it is easy to fit into the test set. So we need to use Cross-validation to select the optimal model. Cross-validation of the training set is performed, and the best performance model is chosen to predict the test set. But this is only to say that after we have chosen the parameters, we get the optimal iteration round in this way instead of using Early_stopping_rounds to select the optimal number of iterations on the test set. So for parameter tuning still need to use the way of grid search to implement. Early_stopping_rounds, the number of early stops, which requires at least one element in the evals, if there are multiple, press the last one to execute. such as Evals = [(Dtrain, ' Train '), (Dval, ' Val ')] Verbose_eval (You can enter a Boolean or numeric type), also requires at least one element in Evals. If true, the evaluation result for elements in Evals is output in the result, and if you enter a number and assume 5, the output is once every 5 iterations.
As for the xgboost, he is still a novice, Mark some of the notes of the bosses, and slowly learn. The feeling is not much better than the training network, seems to be more trouble than NN Alchemy.
The experience of xgboost parameter adjustment
On xgboost Scientific parameter
Kaggle Advanced Series: Zillow Competition feature extraction and model fusion (lb~0.644)