Customer churn prediction--based on R language C5.0

Last Update:2018-08-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For China's major telecom operators, in the overall market size is relatively stable, the maintenance of the existing customers is to ensure that the most important part of their profits. Therefore, the prediction of the possibility of customer churn is directly related to the operator's customer maintenance focus is correct or not. This article will be based on the "Bear" base case: Collect customer churn, to demonstrate the customer churn based on the C5.0 algorithm, data download click Open link.

First, data structure view and preliminary analysis

Read and view the data (see below), a total of 10 variables, where the ID is the unique identification of each user, the need to delete the forecast analysis, the loss of users as a result of variables, "0" said not lost, "1" said has been lost.

>customers<-read.csv ("Customer.csv", stringsasfactors= FALSE)

See the overall user churn (see below), you can find that the number of users lost more than the number of users lost

In addition, we can also look at the form of a crosstab to see the relationship between the variables and the loss of users (see the next multiple map)

From the above figure is not difficult to find, signed a service contract, changed behavior, there have been linked to purchase, group users, the package of higher rates of users have a higher probability does not lose, on the contrary, the probability of user loss is relatively high.

At the same time, we can also look at the user's behavior, including the use of the number of months, additional traffic and the length of additional calls, the distribution, see the following figure

>par (Mfrow=c (1,3)) # #将画板变为1行3列的样式, so that three graphs are distributed in the same row
> hist (customers$ Use the number of months, main = ' Use the number of months distribution ', Xlab = "Use the number of months", Ylab = "Frequency")
> hist (customers$ Extra call length, main= "Extra call length distribution", xlab= "extra call Length", ylab= "frequency")
> hist (customers$ Extra flow, main= "Extra flow Distribution", xlab= "Extra traffic", ylab= "frequency")

Visible, the use of the number of months in the majority of 12-14 months, additional call time and extra traffic more concentrated distribution.

Ii. loss prediction and model evaluation

First, you need to randomly divide the raw data into training sets and test sets.

> Set.seed (one) # #设置随机可重复
> T_sample<-sample (4975,4000) # #设置训练集抽取随机因子, training set contains 4,000 records
> C_train<-customers[t_sample,] # #抽取训练集
> C_test<-customers[-t_sample,] # #以提出训练集的形式, extract test set

To see if the records in the training set and test set meet the random distribution requirements, see the following figure, basically meet the random distribution requirements

> prop.table (Table (c_train$ lost user))
> prop.table (Table (c_test$ lost user))

Then, the C5.0 decision tree algorithm is used to train

> C_train<-c_train[-1]
> c_test<-c_test[-1]# #去掉id

Library (C50)
> c_model<-c5.0 (c_train[-9],c_train$ lost user) # #c_train want to eliminate the class variable "drain user"
> Summary (c_model) # #查看树

As the above figure shows, the tree has 7 branches, and accurately divided into 3,965 records, the error rate is only 0.9%.

The following test sets are used to evaluate the model C_model

> c_t_model<-predict (c_model,c_test)

> table (c_test$ lost Users, C_t_model) # #利用交叉表查看预测的准确率情况

It can be seen that the prediction accuracy rate of the model is over 98.15%, and only 18 errors are predicted, among which the loss of the users is predicted to be not lost to 4, and the non-loss users are predicted to be the lost users of 14.

Next, try to optimize the model (see Click to open the link, there are two ways to optimize the model)

Adaptive enhancement algorithm is a combination of many weak learning algorithms, making such a combinatorial algorithm much better than any single algorithm. In the C5.0 algorithm, the boosting algorithm can be introduced to represent the number of independent decision trees used in the model through parameter trials.

> c_model_boost10<-c5.0 (c_train[-9],c_train$ lost user, trials = ten) # #trials = 10 has become a fact-standard number
> c_t_model_b<-predict (c_model_boost10,c_test)
> table (c_test$ lost users, C_t_model_b)

The

Forecast accuracy rate rose to 98.46%, which was not obvious. But the loss of users is predicted to be not lost 6, more than the original model of 2, and the loss of users are predicted to be the cost of not losing more (can be solved by adding a cost matrix, see Click to open the link), so the optimization effect is not obvious.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Customer churn prediction--based on R language C5.0

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Customer churn prediction--based on R language C5.0

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support