Kaggle is currently the best place for stragglers to use real data for machine learning practices, with real data and a large number of experienced contestants, as well as a good discussion sharing atmosphere.
Tree-based boosting/ensemble method has achieved good results in actual combat, and Chen Tianchi provides high-quality algorithm implementation Xgboost also makes it easier and more efficient to build a solution based on this method, and many of the winning schemes of the game use Xgboost.
This article records a complete (abbreviated but complete) process of giving a full model from scratch to the final use of xgboost, if you have some basic concepts of machine learning, but do not live on actual data, then this article may be right for you:). Combat, Homesite competition
If you do not smell it, you cannot smell it, but if you do not know it, you do not know it.
--Xun Zi
So we start a specific game, Homesite competition, this competition is the insurance company based on previous user information and insurance quotes, and finally whether to buy insurance data, hoping to finally get a classification model, based on user information and quotes, predict whether the user will purchase the single insurance.
There are a lot of contestants who will release their code for your reference (in Kaggle, the code for sharing the discussion is called a kernel). Let's start with a xgboost-based code that is simple but works fine.
After downloading the required data for Homesite competition and the kernel code mentioned earlier, the structure of each file is as follows:
Required Documents
If you have installed NumPy, Scikit-learn, Pandas, xgboost and other required packages, then enter the SRC folder to run the code can produce a result. Code Explanation
A line of detailed introduction is too verbose, pick some important part of the note: Read the data
The first is to import the required packages, then read the data and see what the data looks like:
Read data to remove meaningless feature
Quotenumber is only used to identify a property of each business, without any physical meaning, can be safely removed from the feature, without fear of affecting the training of the model effect:
Remove a column of unused properties convert feature to a more physically meaningful format
The date in the discovery data is just STR, and we need to convert it to datetime:
Check Date Properties
Convert Date to DateTime
If we turn DateTime into a year and a month, it is a better feature for physical meaning:
Convert datetime to Month day
You can then remove the datetime column:
Remove datetime column check missing value
In fact, this step should be done earlier, but the data involved in this article has not much impact.
There are indeed missing values in the discovery data,
Check if there are missing values
See where these missing data are, and what they look like:
See where the missing values are.
Although Xgboost built-in to the missing value of the processing, but more reasonable processing always need specific data specific analysis, you can see detailed data cleaning/visualization This kernel is how to deal with the missing values.
Here is a simple process, fill in all the missing values with an unlikely value:
Simple padding missing values, here of course, can do more meticulous processing, such as the number of data types to fill in the number, but in this task does not affect the nature of the class feature do Labelencode
Many of the features in real-world data are not numeric types, but category types, such as red/blue/white, although decision trees are naturally adept at dealing with the characteristics of category types, but we still need to convert the original string value to the category number.
found that the feature in train and test are not consistent, and the Quoteconversion_flag attribute in train is not in test (the Quoteconversion_flag here indicates whether the business is finally made, test_ The set should certainly not have this attribute, or let you predict what),
Properties of trainset and Testset show
The other properties are the same:
The difference is limited to Quoteconverion_flag
Do Labelencode for features of non-numeric types:
Labelencode
When done, the string is converted to the category number:
Attributes after labelencoding use CV (cross validation) to do the XGB classifier model's assistant
Use CV to tune the parameters, of course, the parameter space is your own choice:
CV, Stratifiedkfold retains the proportion of positive and negative samples in each fold consistent with that ratio in the sample complete
To export the final result:
Output Best Model
Of course you can also see how the model behaves under each parameter:
Comparison of model results under each parameter
Here, basically completed the time, although the brief, but also the general complete data analysis in combat.