For China's major telecom operators, in the overall market size is relatively stable, the maintenance of the existing customers is to ensure that the most important part of their profits. Therefore, the prediction of the possibility of customer churn is directly related to the operator's customer maintenance focus is correct or not. This article will be based on the "Bear" base case: Collect customer churn, to demonstrate the customer churn based on the C5.0 algorithm, data download click Open link.
First, data structure view and preliminary analysis
Read and view the data (see below), a total of 10 variables, where the ID is the unique identification of each user, the need to delete the forecast analysis, the loss of users as a result of variables, "0" said not lost, "1" said has been lost.
>customers<-read.csv ("Customer.csv", stringsasfactors= FALSE)
See the overall user churn (see below), you can find that the number of users lost more than the number of users lost
In addition, we can also look at the form of a crosstab to see the relationship between the variables and the loss of users (see the next multiple map)
From the above figure is not difficult to find, signed a service contract, changed behavior, there have been linked to purchase, group users, the package of higher rates of users have a higher probability does not lose, on the contrary, the probability of user loss is relatively high.
At the same time, we can also look at the user's behavior, including the use of the number of months, additional traffic and the length of additional calls, the distribution, see the following figure
>par (Mfrow=c (1,3)) # #将画板变为1行3列的样式, so that three graphs are distributed in the same row
> hist (customers$ Use the number of months, main = ' Use the number of months distribution ', Xlab = "Use the number of months", Ylab = "Frequency")
> hist (customers$ Extra call length, main= "Extra call length distribution", xlab= "extra call Length", ylab= "frequency")
> hist (customers$ Extra flow, main= "Extra flow Distribution", xlab= "Extra traffic", ylab= "frequency")
Visible, the use of the number of months in the majority of 12-14 months, additional call time and extra traffic more concentrated distribution.
Ii. loss prediction and model evaluation
First, you need to randomly divide the raw data into training sets and test sets.
> Set.seed (one) # #设置随机可重复
> T_sample<-sample (4975,4000) # #设置训练集抽取随机因子, training set contains 4,000 records
> C_train<-customers[t_sample,] # #抽取训练集
> C_test<-customers[-t_sample,] # #以提出训练集的形式, extract test set
To see if the records in the training set and test set meet the random distribution requirements, see the following figure, basically meet the random distribution requirements
> prop.table (Table (c_train$ lost user))
> prop.table (Table (c_test$ lost user))
Then, the C5.0 decision tree algorithm is used to train
> C_train<-c_train[-1]
> c_test<-c_test[-1]# #去掉id
Library (C50)
> c_model<-c5.0 (c_train[-9],c_train$ lost user) # #c_train want to eliminate the class variable "drain user"
> Summary (c_model) # #查看树
As the above figure shows, the tree has 7 branches, and accurately divided into 3,965 records, the error rate is only 0.9%.
The following test sets are used to evaluate the model C_model
> c_t_model<-predict (c_model,c_test)
> table (c_test$ lost Users, C_t_model) # #利用交叉表查看预测的准确率情况
It can be seen that the prediction accuracy rate of the model is over 98.15%, and only 18 errors are predicted, among which the loss of the users is predicted to be not lost to 4, and the non-loss users are predicted to be the lost users of 14.
Next, try to optimize the model (see Click to open the link, there are two ways to optimize the model)
Adaptive enhancement algorithm is a combination of many weak learning algorithms, making such a combinatorial algorithm much better than any single algorithm. In the C5.0 algorithm, the boosting algorithm can be introduced to represent the number of independent decision trees used in the model through parameter trials.
> c_model_boost10<-c5.0 (c_train[-9],c_train$ lost user, trials = ten) # #trials = 10 has become a fact-standard number
> c_t_model_b<-predict (c_model_boost10,c_test)
> table (c_test$ lost users, C_t_model_b)
The
Forecast accuracy rate rose to 98.46%, which was not obvious. But the loss of users is predicted to be not lost 6, more than the original model of 2, and the loss of users are predicted to be the cost of not losing more (can be solved by adding a cost matrix, see Click to open the link), so the optimization effect is not obvious.