Finished Kaggle game has been nearly five months, today to summarize, for the autumn strokes to prepare.
Title: The predictive model predicts whether the user will download the app after clicking on the mobile app ad based on the click Data provided by the organizer for more than 4 days and about 200 million times.
- The volume of data is large and there are 200 million of them.
- The data is unbalanced and the number of clicks downloaded is much smaller than the number of downloads not clicked
Processing ideas for unbalanced datasets:
The sample is generally sampled and under sampling, as the name implies is more samples less mining a little, a little more samples. In extreme cases, when there are too many samples, you can do enhanced learning, which is the small sample I give me to increase the noise. But because of our prediction problem, it is a continuous sequence of time, there is no way to do a sequence of time to do a different frequency sampling, so we have no way to do on-and-off sampling, so, for this problem, what we do is, in the algorithm principle, the introduction of a regular term to limit its imbalance rate. Introducing a coefficient, which is its imbalance rate, is approximately 99.7% in this data set. specifically to our model, it has a parameter called isunbanlanced (whether it is unbalanced), set to True, it can automatically detect the imbalance rate.
Next, for this unbalanced data set, use some more reliable evaluation metrics. Evaluation indicators with general accuracy is not good, for example, we now have a classifier, for all patients to determine whether the disease, so a fool classifier, the population may be one of the probability that the patient is extremely, then, the accuracy of this classifier can reach several 9, such as 99.9%, But this number is meaningless, it does not play a role in classification. Therefore, our evaluation index is not able to use the accuracy rate, the general name called Auc,auc is called area under Curve, is defined as a ROC curve, obviously the area of the value will not be greater than 1. The transverse and longitudinal axes of the ROC curve are Zhenyang and pseudo-positive, respectively, and the Zhenyang rate and pseudo-positive rate are 1, then a curve is drawn, and the AUC is the area under the ROC curve. The greater the AUC, the higher the accuracy rate. The AUC is not affected by this imbalance, so the AUC is often used to do the indicator.
The above is for the processing of unbalanced datasets.
The following is a formal process.
- The first step is to clean the data. You want to label the data.
- The second step is feature engineering, the first method, such as the time type of a one-dimensional value, to break it into day hour minute second so it becomes four-dimensional. There is also a frequency, such as a device number in which it appears in all data. Because it provides the device number, which is a category feature, to turn this into a frequency, it becomes a continuous value feature. For the frequency, there is also a point is that we introduced a called confidence factor, such as the frequency of abnormal values, in 200 million data has occurred once and two times, in fact, no difference, but in the frequency of this dimension is one and twice times the relationship, we do not want this difference, we need to introduce a confidence factor, the numerator denominator plus Log, so the gap narrows. And then add a cross combination between the features, for example, the number of users with the same device numbers, the average, variance, absolute value of their channel numbers, these data are very meaningful. Through feature engineering, our features have been extended from 7 to 50-dimensional.
- Build the model. We've used LIGHTGBM here. LGB's principle, two key points. 1. It is a tree model, the bottom layer is a two-fork decision tree; 2. It is a boosting integrated learning on a tree model. There are many kinds of integration learning, boosting is one of them. The boosting principle is to produce many models, and the second model fits not the real object, but the residuals between the previous model and the real model, and then adds all the models together. So it's an additive model, and then it's constantly approaching the real situation. This is the two characteristics of LIGHTGBM.
The difference between Q:LIGHTGMB and GDBT, XGB?
A: The difference is:
1. Two The bottom of the tree to find the best income classification points, but too much data, the complexity is very high. XGB chose the pre-sorting algorithm, that is to say you want to find the best classification point, I first with you pre-order, the complexity of the back down. But to maintain a well-ordered characteristics, need space, and time sacrifices, but, LGB is not so, it does not use the traditional algorithm, it uses a statistical algorithm, called the histogram algorithm, the advantage of this algorithm is not to all the data to do the operation, but the data fall on the interval above, Then the complexity will drop quickly. The equivalent of neighboring data is reduced to one category. Histograms are often subjected to normal distributions, and generally have a peak. When we find that peak, it's the best classification point by default. Although in machine learning, often find the best classification point is not the absolute best classification point, but after several calculations, can achieve the same effect, is to sacrifice a certain precision, but faster than you a lot faster.
2. Because the boosting model is constantly approaching, it can be very accurate, but it has a problem, is easy to fit. So, xgb a limit to the depth of the tree, but LGBM is a limitation on the leaf nodes.
3. XGB does not support category features. For example, you are a man, I am a woman, this is the category. There is also a feature called continuous-value features, such as age, 1-year-old, 2-year-old, 20-year-old, which is a continuous-valued feature. The input of the category feature requires the introduction of a single-hot code (Onehot encoding), XGB needs to do a single-hot code on the class feature, but LGB does not need it because it integrates
4. LGB is optimized for parallel support and is faster.
Q: Why use LGB, have you tried other models and used LR models?
Application scenarios for A:LR and LIGHTGBM. The algorithm model of LR and FM, which is based on logistic regression, is used in sparse matrices, and the tree model represented by LIGHTGBM is applicable to the less sparse and more continuous values. The industry tends to use a tree model to pick features, because the features are sparse, and then the sparse matrices are sent to LR for a classification. Because of our sample it is not sparse, so do it directly with LIGHTGBM.
- Run the model. The out of the AUC is 0.98.
Kaggle Contest Summary