Original address
The previous article is for small data sets, the introduction is not recommended from the big data set start, can not consider machine memory, without out-of-core online learning, regardless of the distribution, can focus on the model itself.
Next I made two ad CTR estimates related to the match, but the game was already closed, fortunately, we can also submit the results to see where close can be ranked. Actual game 6. Display Advertising Challenge
Predict Click-through rates on display ads. Https://www.kaggle.com/c/criteo-display-ad-challenge
This is an ad CTR forecast for the tournament, sponsored by a well-known advertising agency Criteo. The data includes 40 million training samples, 5 million test samples, features including 13 numeric features, 26 category features, and evaluation indicators of Logloss.
CTR industry practices are generally LR, but features will be a variety of combinations/transform, can be up to billion-dimensional. Here I also prefer LR, feature missing values I use the majority, for 26 categories of features using one-hot encoding, numerical characteristics I use pandas to draw out to find that does not conform to the normal distribution, there is a large offset, there is no scale to [0,1], the use is based on the five sub-points (min,25%, median, 75%,max) is divided into 6 intervals (negative/over-large values are divided into 1 and 6 intervals as outliers), and then one-hot encoded with the final feature of about 1 million, the training file 20+g.
Emphasis may be encountered in the pit: 1.one-hot best to achieve their own, unless your machine memory is large enough (need full load to numpy, and non-sparse), 2.LR preferably with SGD or mini-batch, and Out-of-core mode (http:// scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html# EXAMPLE-APPLICATIONS-PLOT-OUT-OF-CORE-CLASSIFICATION-PY), unless your memory is still large enough; 3.Think twice before Code. Due to the large amount of data, in the middle of the error re-run time finished relatively high.
I found that Sklearn's LR and Liblinear's LR had very different performances, Sklearn L2 regularization results better than l1,liblinear L1 better than L2, and I understand that their optimization method is different. The final result of liblinear LR L1 Optimal, logloss=0.46601,lb for 227th/718, which is also in line with lasso produce sparse intuition.
I also tried the xgboost,logloss=0.46946 alone, perhaps with GBRT on the high-dimensional sparse characteristics of the poor effect. Facebook has a paper on the GBRT output as transformed feature feed downstream of the linear classifier, has achieved good results, you can refer to the next. (Practical lessons from predicting Clicks on Ads at Facebook)
I just experimented with LR as a baseline, and there are a lot of pointers in the back, which can refer to the solution given by the forum winner, such as: 1. Vowpal wabbit Tools do not distinguish between categories and numerical features, 2.libFFM tools do feature cross-combinations, 3.feature hash trick;4. The rate of CTR for each feature is added as a new feature; 5. Multi-model ensemble, etc. 7. Avito Context Ad Clicks
Predict If context ads would earn a user's click. Https://www.kaggle.com/c/avito-context-ad-clicks
Unlike a CTR match, this data is not desensitization, features have a clear meaning, userinfo/adinfo/searchinfo and other features need to join together with Searchstream file to form a complete training/test sample. The data contains 392,356,948 training samples and 15,961,515 test samples, which are basically the raw text features such as ID category features and query/title. The evaluation index is still logloss.
Because the amount of data is too large to run a set of results is too time-consuming, according to the match 6 reference, currently I only choose Liblinear lasso LR To do a set of results. The ultimate goal is to predict contextual ad, in order to reduce the amount of data, *searchstream have filtered non-contextual, visitstream and phonerequeststream and params I have not used at present, But in fact are very valuable features (such as query and title of various similarity), can be tried later.
For this big data, on the small memory machine Sklearn and pandas processing has been very laborious, then need to customize to implement the left join and One-hot-encoder, the use of the Out-of-core way, but really slow ah. Similar to the game 6,price numerical features or three-bit mapping into the category features and other categories of features together One-hot, the final features about 6 million, of course, the sparse matrix is stored, train file size 40G.
Libliear seemingly do not support mini-batch, in order to save trouble have to find a large memory server dedicated to run lasso LR. As a result of the above filtering a lot of valuable information, there is no similar LIBFM or LIBFFM to do feature crossover combination, the effect is not good, Logloss only 0.05028,lb ranking 248th/414.
For the game needs to be well researched under the practice of Daniel, to see the relevant paper, their own blind to run too much time, refueling Bar ~ Summary and sentiment
By participating in the Kaggle to improve their machine learning combat ability, the problem and the data have some feelings, a general understanding of the applicable scenarios of the models. Of course, there are many needs to improve, such as feature combination/transform/hash trick, model ensemble method, etc., the realization of scalable (for example, using pipeline).
Ps: Be sure to select a few suitable for their own efficient tool kits, and 2-3 of them have seen the source, it is best to do customized optimization. Hope that we all join Kaggle, welcome to explore together to improve ~