Kaggle Previous User classification problem

Source: Internet
Author: User
Tags hash

Kaggle Address

Reference Model

In fact, the key points of this project in the existence of a large number of discrete features, for the discrete dimension of the processing method is generally to each of the discrete dimension of each feature level like the SQL row to be converted into a dimension, the value of this dimension is only 0 or 1. But this is bound to lead to a burst of dimensions. This project is typical, with the merge function to connect the user table and the activity tables, there are a large number of discrete dimensions. This time to use a processing dimension too many methods, is called "Hash Trick."

Assuming that your discrete dimension is a user's education level, each feature is split into a single dimension with the following dimensions:

Graduate or above, undergraduate, college, high school, middle school, primary school

There is a hash function, the size of the hash function is 5:

Hash (graduate or above) =2
hash (undergraduate) =3 Hash (college
) =4
Hash (high school) =2 Hash (middle school)
=0
Hash (elementary school) =0

Because the size of the hash function is 5, all results cannot be greater than or equal to 5 (0-4)

So after the original "post-graduate, college, college, High school, junior high, primary School," the 6 features become 5 features, each characteristic value is the result of this hash number of occurrences:

[2,0,2,1,1]

0 appeared two times, 1 appeared 0 times, 2 appeared two times and so on.

A formula appears in the Reference Model:

F <-~. -People_id-activity_id-date.x-date.y-1

Explained here, for Hashed.model.matrix this function, because is used to reduce the dimension, do not care about the dependent variable all tide symbol left there is no value, minus a number to exclude some dimensions, and finally "1" is because Hashed.model.matrix will produce some unknown data in the first column, refer to this link: Click to open the link



param <-List (objective = "binary:logistic",
              eval_metric = "AUC",
              booster = "Gblinear",
              eta = 0.03)

The above param is the last boost parameter, you can see the famous logistic regression, ETA represents the scale of the boost adjustment weight,

Booster parameter can be selected Gblinear or gbtree, need to be introduced


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.