Introduction to Data Science, using Xgboost preliminary Kaggle

Source: Internet
Author: User
Tags datetime homesite numeric xgboost

Kaggle is currently the best place for stragglers to use real data for machine learning practices, with real data and a large number of experienced contestants, as well as a good discussion sharing atmosphere.

Tree-based boosting/ensemble method has achieved good results in actual combat, and Chen Tianchi provides high-quality algorithm implementation Xgboost also makes it easier and more efficient to build a solution based on this method, and many of the winning schemes of the game use Xgboost.

This article records a complete (abbreviated but complete) process of giving a full model from scratch to the final use of xgboost, if you have some basic concepts of machine learning, but do not live on actual data, then this article may be right for you:). Combat, Homesite competition

If you do not smell it, you cannot smell it, but if you do not know it, you do not know it.
--Xun Zi

So we start a specific game, Homesite competition, this competition is the insurance company based on previous user information and insurance quotes, and finally whether to buy insurance data, hoping to finally get a classification model, based on user information and quotes, predict whether the user will purchase the single insurance.

There are a lot of contestants who will release their code for your reference (in Kaggle, the code for sharing the discussion is called a kernel). Let's start with a xgboost-based code that is simple but works fine.

After downloading the required data for Homesite competition and the kernel code mentioned earlier, the structure of each file is as follows:
Required Documents

If you have installed NumPy, Scikit-learn, Pandas, xgboost and other required packages, then enter the SRC folder to run the code can produce a result. Code Explanation

A line of detailed introduction is too verbose, pick some important part of the note: Read the data

The first is to import the required packages, then read the data and see what the data looks like:

Read data to remove meaningless feature

Quotenumber is only used to identify a property of each business, without any physical meaning, can be safely removed from the feature, without fear of affecting the training of the model effect:
Remove a column of unused properties convert feature to a more physically meaningful format

The date in the discovery data is just STR, and we need to convert it to datetime:
Check Date Properties
Convert Date to DateTime

If we turn DateTime into a year and a month, it is a better feature for physical meaning:
Convert datetime to Month day

You can then remove the datetime column:
Remove datetime column check missing value

In fact, this step should be done earlier, but the data involved in this article has not much impact.

There are indeed missing values in the discovery data,

Check if there are missing values

See where these missing data are, and what they look like:
See where the missing values are.

Although Xgboost built-in to the missing value of the processing, but more reasonable processing always need specific data specific analysis, you can see detailed data cleaning/visualization This kernel is how to deal with the missing values.
Here is a simple process, fill in all the missing values with an unlikely value:
Simple padding missing values, here of course, can do more meticulous processing, such as the number of data types to fill in the number, but in this task does not affect the nature of the class feature do Labelencode

Many of the features in real-world data are not numeric types, but category types, such as red/blue/white, although decision trees are naturally adept at dealing with the characteristics of category types, but we still need to convert the original string value to the category number.

found that the feature in train and test are not consistent, and the Quoteconversion_flag attribute in train is not in test (the Quoteconversion_flag here indicates whether the business is finally made, test_ The set should certainly not have this attribute, or let you predict what),
Properties of trainset and Testset show

The other properties are the same:
The difference is limited to Quoteconverion_flag

Do Labelencode for features of non-numeric types:

When done, the string is converted to the category number:
Attributes after labelencoding use CV (cross validation) to do the XGB classifier model's assistant

Use CV to tune the parameters, of course, the parameter space is your own choice:
CV, Stratifiedkfold retains the proportion of positive and negative samples in each fold consistent with that ratio in the sample complete

To export the final result:
Output Best Model

Of course you can also see how the model behaves under each parameter:
Comparison of model results under each parameter

Here, basically completed the time, although the brief, but also the general complete data analysis in combat.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.