Note: Due to professional requirements, all the figures in this article have been modified, not real numbers, I am sorry to not post the source code
Goal:
Analysis of customer characteristics of risk
Background:
At present the marketing department uses the promotion analysis system only for the customer survey return information analysis, and only has the age/gender/marital status/Income four dimensions, the forecast precision is not high. The marketing department wants to analyze the key factors that affect their choice of insurance products from existing life insurance customer information, thereby improving marketing activities more specifically
Modeling process:
Input: From the existing tens of millions of customer information to extract their personal information, after cleaning left more than 100 characteristics, including marriage, age, income, height and weight, occupational risk degree, residential areas. Use the categories of existing products as classified information, including savings insurance, life insurance, term insurance, investment insurance, etc.
Algorithm:
First, use decision trees to make rough predictions to verify the validity of input data, and use random forest to output important features
The advantage of the decision tree is that it is intuitive, easy to implement, and can handle both discrete and continuous variables, and the process of adding variable changes is not small. A year of customer information was extracted from the data as a training set, and a decision tree was established to predict the category of insurance products selected by the customer.
Results Analysis:
The first run hit rate is only 40%, analyzing its confusion matrix:
It can be seen that the decision tree in the last classification of the effect is very poor, can be said to have no effect, in the third and fourth classification is not high degree of distinction.
The last classification is the investment insurance, indicating that the existing customer characteristics do not meet the difference between the classification of the investment insurance, need to add the characteristic value
Thirdly, the four categories are in fact a periodic insurance, one is the payment of a regular period of time, the other is insured by the age of the regular, essentially the difference is not small, can be combined
Temporarily filter out investment insurance customer information, combined with periodic insurance customer information, re-run confusion matrix
You can see that the classification has improved, and the hit rate can reach 60%.
234 classification of the degree of distinction seems to be good, only the first type of savings insurance category is not high, the part of the customer information filtered out, you can achieve a good hit rate.
In addition to the accuracy of the decision tree, the advantages of the random forest are more important than the feature . And that's exactly what the market segment needs.
The end result shows that over the past 10 years, the customer's marital status/age/Height weight has contributed the most to the customer's choice of insurance products.
The results of the model will eventually appear on tableau:
such as the characteristic value contribution degree trend
Statistics of policy number under important feature classification
Random forest-life insurance Customer information analysis