Links to Kaggle discussion area: HTTPS://WWW.KAGGLE.COM/C/CRITEO-DISPLAY-AD-CHALLENGE/FORUMS/T/10555/3-IDIOTS-SOLUTION-LIBFFM
--------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------
Experience of feature processing in practical engineering:
1. Transforming infrequent features into a special tag. Conceptually,infrequent features should only include very little or even no information, so it should is very hard for a m Odel to extract those information. In Fact,these features can is very noisy. We gotsignificant Improvement (more than 0.0005) by transforming these features into a special tag. A machine learning problem, which contains discrete characteristics, if the discrete feature takes a certain value A is very few times, then the dimension feature takes this value, actually does not contain information, we can think that the dimension of a value is noise, so, for the discrete characteristics of the processing method, if the discrete feature appears a value of less than 10 times ( The threshold must be determined experimentally to determine that machine learning is the art of an experiment, so we set the value to a fixed value. The effect can be improved by setting a constant for a discrete feature with a small value. Experience. In the author's experiment, the effect was improved by 0.0005 (Ctr increased by 0.005, which is a big lift). In this paper, we compare the experimental results before and after the experiment. This threshold is determined by experimentation,
2. Transforming numerical features (I1-I13) to categorical features. Empirically we observe using categorical features are always better than using numerical features. However, too many features is generated if numerical features is directly transformed into categorical features, so we u Se v <-Floor (log (v) ^2) to reduce the number of features generated.
Experience, we tend to use discrete features, and discrete features are easier to learn than continuous features. (Discrete features have two advantages over discrete features, one is that discrete features are more highly explanatory than continuous features.) The second is: discrete features are easier to handle than continuous features. The V <-Floor (log (v) ^2) is used to compress the range of continuous features, making it possible to convert continuous features into discrete features without many features. Play a role in compression. (Kaggle, the way the feature is discretized is done through a hash map, mapped to a space of 100w dimensions)
3. Data Normalization. According to We empirical experience, instance-wise data normalization makes the optimization problem easier to be solve D. For example, for the following vectors x (0, 0, 1, 1, 0, 1,0, 1, 0), the standard data normalization divides each eleme NT of x by by the 2-norm of X. The normalized x becomes (0, 0, 0.5, 0.5, 0, 0.5, 0, 0.5, 0). The hashing trick does not contribute any improvement on the leader board. We apply the hashing trick only because it makes our life easier to generate features.
In experience, eigenvectors are normalized. Using a hash map does not make the effect much more of an improvement. The use of hash is a convenient means of discretization of features. (discretization of both continuous and discrete features, hash trick is a very good method).
--------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------
Using the GBDT tree to generate GBDT features, the resulting GBDT feature is a nonlinear feature. Using the GBDT tree to generate features is an enlightening algorithm, and there is no theoretical proof in the strict sense.
Features generated using the GBDT tree are discrete features, assuming that each tree has 255 nodes and 30 trees, the total number of features generated using One-hot encoding is 255*30.
--------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------Q:
Does special care are needed for generating the GBDT features to avoid over fitting?
I didn ' t thought it needed but it looks what I do has some kind of over fitting problem when trying to implement if naive Ly.
A:If you find GBDT over fits, I think reducing the number of trees and the depth of a tree could help.
GBDT over fitting can be done by reducing the number of trees and reducing the depth of the tree.
Problem: Extracting features using GBDT also leads to the overfit of OVERFIT,GBDT for two reasons, too many trees, or trees too deep, how to judge Overfit, the GBDT features generated by statistical samples, and if the value of each leaf node in each tree is too small, then it can be considered as noise. That's overfit.
--------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------q:thanks for such a great solution! Hope you don ' t mind another little question.
How and when do you choose threshold for transforming infrequent features?
A:This Threshold was selected based on experiments. We tried something like 2, 5, 10,20, A, and chose the best one among them. Experiment to determine thresholds.
Machine learning is the science of experiments, some thresholds are to be determined through experiments, many parameters are not empirical formula, the experiment is to speak according to the facts. --------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------
Q: How to calculate Ctr
A:suppose we have a impression whose label is 1. If the prediction for this impression are 0, then we should get a infinite logloss. In practice the desired, so on the submission system there is a ceiling (C) for Logloss. If the Logloss for a impression is greater than this ceiling and then it'll be truncated. Note that we don't know the value of this ceiling. Using thisfeature, we can hack the average CTR by the submissions. The firstsubmission contains all ones, so the logloss we get is P1 (nr_non_click*c)/nr_instance. The secondsubmission contains all zeros, so the logloss we get is P2 = (nr_click*c)/nr_instance.
We can then get the average CTR = p2/(P1+P2).
--------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------
------------------------------------------------------------------------------------------------------------ ----------------------------------------------------------------