Kaggle Address
Reference Model
In fact, the key points of this project in the existence of a large number of discrete features, for the discrete dimension of the processing method is generally to each of the discrete dimension of each feature level like the SQL row to be converted into a dimension, the value of this dimension is only 0 or 1. But this is bound to lead to a burst of dimensions. This project is typical, with the merge function to connect the user table and the activity tables, there are a large number of discrete dimensions. This time to use a processing dimension too many methods, is called "Hash Trick."
Assuming that your discrete dimension is a user's education level, each feature is split into a single dimension with the following dimensions:
Graduate or above, undergraduate, college, high school, middle school, primary school
There is a hash function, the size of the hash function is 5:
Hash (graduate or above) =2
hash (undergraduate) =3 Hash (college
) =4
Hash (high school) =2 Hash (middle school)
=0
Hash (elementary school) =0
Because the size of the hash function is 5, all results cannot be greater than or equal to 5 (0-4)
So after the original "post-graduate, college, college, High school, junior high, primary School," the 6 features become 5 features, each characteristic value is the result of this hash number of occurrences:
[2,0,2,1,1]
0 appeared two times, 1 appeared 0 times, 2 appeared two times and so on.
A formula appears in the Reference Model:
F <-~. -People_id-activity_id-date.x-date.y-1
Explained here, for Hashed.model.matrix this function, because is used to reduce the dimension, do not care about the dependent variable all tide symbol left there is no value, minus a number to exclude some dimensions, and finally "1" is because Hashed.model.matrix will produce some unknown data in the first column, refer to this link: Click to open the link
param <-List (objective = "binary:logistic",
eval_metric = "AUC",
booster = "Gblinear",
eta = 0.03)
The above param is the last boost parameter, you can see the famous logistic regression, ETA represents the scale of the boost adjustment weight,
Booster parameter can be selected Gblinear or gbtree, need to be introduced