Abstract: This article describes how to apply machine learning algorithms to specific problems. Taking the customer churn rate of telecom operators as an example, this paper introduces the basic process of machine learning from the point of view of the problem, the analysis of the data, the evaluation of the algorithm, and the final result display. User data comes from the Internet.
1 Defining the problem
The problem of customer churn is an important issue for telecom operators, and it is also a popular case. According to the calculations, recruiting new customers is much more expensive (usually 5-20 times the gap) than retaining the existing customers. Therefore, how to retain the current customer is a very meaningful thing for the operator. In this paper, we hope to analyze the problem of customer churn rate through a public data, and can take everyone to understand how to apply machine learning prediction algorithm to practical application.
Of course, the actual scenario is much more complex than this example, and if you want to apply it to a project, you need to do a specific analysis of the different scenarios and data.
From the classification of machine learning, this is a classification problem in the supervisory problem. Specifically, it is a two classification problem. All of the data includes some features and, finally, its classification: drain or on the net. Then we'll start with the specifics.
2 Analyzing data
Let's start by importing the data and then looking at the basics of the data.
2.1 Data Import
Import the CSV through pandas, and then we'll look at the basics of the data.
from __future__ import divisionimport pandas as pdimport numpy as npds = pd.read_csv(‘./churn.csv‘)col_names = ds.columns.tolist()print "Column names:"print col_namesprint(ds.shape)
Output:
Column names:[‘State‘, ‘Account Length‘, ‘Area Code‘, ‘Phone‘, "Int‘l Plan", ‘VMail Plan‘, ‘VMail Message‘, ‘Day Mins‘, ‘Day Calls‘, ‘Day Charge‘, ‘Eve Mins‘, ‘Eve Calls‘, ‘Eve Charge‘, ‘Night Mins‘, ‘Night Calls‘, ‘Night Charge‘, ‘Intl Mins‘, ‘Intl Calls‘, ‘Intl Charge‘, ‘CustServ Calls‘, ‘Churn?‘](3333, 21)
As you can see, the entire data set has 3,333 data, 20 dimensions, and the last item is a classification.
2.2 Basic information and types
We can print some data and have a basic understanding of the data and values.
peek = data.head(5)print(peek)
Output:
State account Length Area Code Phone Int ' L plan vmail plan 0 KS 128 415 382-4657 No Yes 1 OH 107 415 371-7191 No Yes 2 NJ 137 415 358-1921 No No 3 OH 408 375-9999 Yes No 4 OK 415 330-6626 Yes No Eve Charge night Mins night Calls night Charge Intl Mins Int L Calls 0 16.78 244.7 91 11.01 10.0 3 1 16.62 254.4 103 11.45 13.7 3 2 10.30 162.6 104 7.32 12.2 5 3 5.26 196.9 89 8.86 6.6 7 4 12.61 186.9 121 8.41 10.1 3 Intl Charge custserv Calls churn? 0 2.70 1 False. 1 3.70 1 False. 2 3.29 0 False. 3 1.78 2 False. 4 2.73 3 False.
We can see that the data set has 20 characteristics, namely the state name, account length, area code, telephone number, international plan, voicemail, daytime minutes, number of daytime calls, daytime charges, minutes in the evening call, number of night calls, night charge, night call minutes, night phone number, night charge, international points Number of clocks, number of international calls, international charges, number of customer service calls, loss or not.
- You can see that there is personal information, you should be able to see some of the information is not related to the loss or not. State name, area code can indicate the location of the customer, and the loss of the relationship between, do not know, the specific location if not classified, should be completely no relationship. And the state, maybe a certain state has some strong competitor? This is also a blind guess, the temporary significance is not big, delete.
- Account length, phone number, not required
- International program, voice mail. Maybe there is a relationship, keep it first.
- The minutes, the number of calls and the charge of the day, night and evening were counted separately. This is important information to keep
- Customer Service phone, customers call complaints more that wastage rate may be large. This is important information to keep.
- Lost or not. This is the classification result.
Then we can look at the type of data, as follows:
ds.info()
Output:
rangeindex:3333 entries, 0 to 3332Data columns (total columns): State 3333 Non-null Objectaccoun T Length 3333 non-null int64area Code 3333 non-null int64phone 3333 non-null objectint ' l Plan 3333 non-null objectvmail Plan 3333 non-null objectvmail Message 3333 non-null int64day Mins 3333 No N-null float64day Calls 3333 non-null int64day Charge 3333 non-null float64eve Mins 3333 non-null Float64eve Calls 3333 non-null int64eve Charge 3333 non-null float64night Mins 3333 non-null float64 Night Calls 3333 non-null int64night Charge 3333 non-null float64intl Mins 3333 non-null float64intl Ca LLS 3333 non-null int64intl Charge 3333 non-null float64custserv Calls 3333 non-null int64churn? 3333 Non-null Objectdtypes:float64 (8), Int64 (8), Object (5) Memory usage:546.9+ KB
See, there is int, float, object. For data that is not data-based, it should be converted to a data row, unless an algorithm such as a decision tree is followed.
So we put churn? The result is converted, and "Int ' L Plan", "Vmail plan", these two parameters are only Yes, no two, so also converted to 01 values.
2.3 Descriptive statistics
Describe () can return specific results for each column.
Number
Average
Standard deviation
25% minute bit
50% decimal Points
75% decimal Points
Maximum value many times you can get the number and proportion of NA.
Account Length Area Code vmail Message Day Mins Day Calls count 3333.000000 3333.000000 3333.00 0000 3333.000000 3333.000000 Mean 101.064806 437.182418 8.099010 179.775098 100.435644 std 39.822106 42.371290 13.688365 54.467389 20.069084 min 1.000000 408.000000 0.000000 0 .000000 0.000000 25% 74.000000 408.000000 0.000000 143.700000 87.000000 50% 101.000000 415.000000 0.000000 179.400000 101.000000 75% 127.000000 510.000000 20.000000 216.400000 114.000000 Max 243.000000 510.000000 51.000000 350.800000 165.000000 Day Charge Eve Mins Eve Calls Eve Charge Night Mins count 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 mean 30.562307 200.980348 100.114311 17.083540 200.872037 std 9.259435 50.713844 19.922625 4.310668 50.573847 min 0.000000 0.000000 0.000000 0.000000 23.200000 25% 24.430000 166.600000 87.000000 14.160000 167.000000 50% 30.500000 201.400000 100.000000 17.120000 201.200000 75% 36.790000 235.300000 114.000000 20.000000 235.300000 Max 59.640000 363.700000 170.000000 30.910000 395.000000 Ni Ght Calls night Charge Intl Mins Intl Calls Intl Charge count 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 mean 100.107711 9.039325 10.237294 4.479448 2.764581 std 19.568609 2.275873 2.791840 2.461214 0.753773 min 33.000000 1.040000 0.000000 0.000000 0.000000 25% 87.000000 7.520000 8.500000 3.000000 2.300000 50% 100.000000 9.050000 10.300000 4.00 0000 2.780000 75% 113.000000 10.590000 12.100000 6.000000 3.270000 Max 175.000000 17.77 0000 20.000000 20.000000 5.400000 custserv Calls count 3333.000000 mean 1.562856 std 1.315491 min 0.000000 25% 1.000000 50% 1.000000 75% 2.000000 Max 9.000000
2.4 Graphical understanding of your data
Some of the previous information is just a preliminary understanding, but it is not enough for machine learning algorithms. Let's take a few dimensions to understand your data further. Tools can be used in digital tables, or graphically (matplotlib) to draw more.
- Characteristics of their own information
- The relationship between characteristics and classification
- The relationship between characteristics and characteristics
Here, in view of time, some relationships are not directly applied to the algorithm itself, but are significant in further algorithm improvements, and there is more to show.
2.4.1 information about the characteristics themselves
Let's take a look at the churn ratio and the number distribution of customer calls
import matplotlib.pyplot as plt%matplotlib inlinefig = plt.figure()fig.set(alpha=0.2) # 设定图表颜色alpha参数plt.subplot2grid((2,3),(0,0)) # 在一张大图里分列几个小图ds[‘Churn?‘].value_counts().plot(kind=‘bar‘)# plots a bar graph of those who surived vs those who did not. plt.title(u"stat for churn") # puts a title on our graphplt.ylabel(u"number") plt.subplot2grid((2,3),(0,2)) ds[‘CustServ Calls‘].value_counts().plot(kind=‘bar‘)# plots a bar graph of those who surived vs those who did not. plt.title(u"stat for cusServCalls") # puts a title on our graphplt.ylabel(u"number") plt.show()
It's easy to understand.
Then, our data is characterized by day, night, night, international have minutes, phone number, charge three kinds of dimensions. So let's take the daytime for example.
import matplotlib.pyplot as plt%matplotlib inlinefig = plt.figure()fig.set(alpha=0.2) # 设定图表颜色alpha参数plt.subplot2grid((2,5),(0,0)) # 在一张大图里分列几个小图ds[‘Day Mins‘].plot(kind=‘kde‘) # plots a kernel desnsity estimate of customer plt.xlabel(u"Mins")# plots an axis lableplt.ylabel(u"density") plt.title(u"dis for day mins")plt.subplot2grid((2,5),(0,2)) ds[‘Day Calls‘].plot(kind=‘kde‘) # plots a kernel desnsity estimate of customer plt.xlabel(u"call")# plots an axis lableplt.ylabel(u"density") plt.title(u"dis for day calls")plt.subplot2grid((2,5),(0,4)) ds[‘Day Charge‘].plot(kind=‘kde‘) # plots a kernel desnsity estimate of customer plt.xlabel(u"Charge")# plots an axis lableplt.ylabel(u"density") plt.title(u"dis for day charge")plt.show()
It can be seen that distributions are basically Gaussian distributions, which is also in line with our expectations, and Gaussian distributions are good news for some of our subsequent algorithmic processing.
2.4.2 Correlation of characteristics and classifications
Let's take a look at some of the correlations between features and classifications. For example, the following int plan
import matplotlib.pyplot as pltfig = plt.figure()fig.set(alpha=0.2) # 设定图表颜色alpha参数int_yes = ds[‘Churn?‘][ds[‘Int\‘l Plan‘] == ‘yes‘].value_counts()int_no = ds[‘Churn?‘][ds[‘Int\‘l Plan‘] == ‘no‘].value_counts()df_int=pd.DataFrame({u‘int plan‘:int_yes, u‘no int plan‘:int_no})df_int.plot(kind=‘bar‘, stacked=True)plt.title(u"statistic between int plan and churn")plt.xlabel(u"int or not") plt.ylabel(u"number")plt.show()
We can see that there is a high wastage rate for international calls. Guess they have more options, or more requirements for the service. Special treatment is required. Maybe you need a phone to collect more ideas.
Look again.
#查看客户服务电话和结果的关联fig = plt.figure()fig.set(alpha=0.2) # 设定图表颜色alpha参数cus_0 = ds[‘CustServ Calls‘][ds[‘Churn?‘] == ‘False.‘].value_counts()cus_1 = ds[‘CustServ Calls‘][ds[‘Churn?‘] == ‘True.‘].value_counts()df=pd.DataFrame({u‘churn‘:cus_1, u‘retain‘:cus_0})df.plot(kind=‘bar‘, stacked=True)plt.title(u"Static between customer service call and churn")plt.xlabel(u"Call service") plt.ylabel(u"Num") plt.show()
Basically, it can be seen that the number of customer calls and the final classification is strongly related to the call 3 times more than the rate of loss ratio increased rapidly. This is a very key indicator.
3 Preparing data
Well, we've seen a lot and have a certain understanding of the data. Let's start working on the data in detail.
3.1 Removing unrelated columns
First of all, based on the analysis of the problem, we do the first thing, remove the three columns unrelated columns. State name, telephone number, area code.
- We'll do it with the next step.
3.2 Convert to numeric type
For some features that are not numeric in themselves, the data cannot be used directly by the algorithm, so let's deal with
# Isolate target datads_result = ds[‘Churn?‘]Y = np.where(ds_result == ‘True.‘,1,0)dummies_int = pd.get_dummies(ds[‘Int\‘l Plan‘], prefix=‘_int\‘l Plan‘)dummies_voice = pd.get_dummies(ds[‘VMail Plan‘], prefix=‘VMail‘)ds_tmp=pd.concat([ds, dummies_int, dummies_voice], axis=1)# We don‘t need these columnsto_drop = [‘State‘,‘Area Code‘,‘Phone‘,‘Churn?‘, ‘Int\‘l Plan‘, ‘VMail Plan‘]df = ds_tmp.drop(to_drop,axis=1)print "after convert "print df.head(5)
Output:
After convert $ Length vmail Message Day Mins day Calls Day Charge Eve Mins 0 128 25 265.1 110 45.07 197.4 1 107 26 161.6 123 27.47 195 .5 2 137 0 243.4 114 41.38 121.2 3 84 0 29 9.4 50.90 61.9 4 0 166.7 113 28.34 148.3 Ev E Calls Eve Charge night Mins night Calls night Charge Intl Mins 0 99 16.78 244.7 91 11.01 10.0 1 103 16.62 254.4 103 11.45 13.7 2 110 10.30 162.6 104 7.32 12.2 3 88 5.26 196.9 89 8.86 6.6 4 122 12.61 186.9 121 8.41 10.1 Intl Calls Intl Charge Cust Serv Calls _int ' l Plan_No _int ' l Plan_yes 0 3 2.70 1 1 0 1 3 3 .70 1 1 0 2 5 3.29 0 1 0 3 7 1.78 2 0 1 4 3 2.73 3 0 1 vmail_no vmail_yes 0 0 1 1 0 1 2 1 0 3 1 0 4 1 0
We can see the results, all the data are numeric, and the columns that don't make sense to us are removed.
3.3 Scale data range
We need to do some scale work. The scale of some attributes is too large.
- For logistic regression and gradient descent, the scale gap of a property is too large, which can greatly affect the convergence speed.
- We do all of this here, but we can do this with some prominent features.
#scaleX = df.as_matrix().astype(np.float)# This is importantfrom sklearn.preprocessing import StandardScalerscaler = StandardScaler()X = scaler.fit_transform(X)print "Feature space holds %d observations and %d features" % X.shapeprint "Unique target labels:", np.unique(y)
Output:
Feature space holds 3333 observations and 19 featuresUnique target labels: [0 1]
Others, you can also consider a variety of ways such as dimensionality reduction. However, in practical use, we often first make a model, get a reference result, and then gradually optimize. So we're going to get the data right here.
4 Evaluation algorithm
We will use multiple algorithms to calculate the results and then choose a better one. As follows
# Prepare modelsmodels = []models.append (' LR ', Logisticregression ())] Models.append (' LDA ', Lineardiscriminantanalysis ()) models.append (' KNN ', Kneighborsclassifier ()) models.append (' CART ', Decisiontreeclassifier ())) Models.append ((' NB ', GAUSSIANNB ()) Models.append ((' SVM ', SVC ())) # Evaluate each model in Turnresults = []names = []scoring = ' accuracy ' for name, model in models:kfold = Kfold (n_splits=10, random_state=7) Cv_results = Cross_val_score (model, X, Y, Cv=kfold, scoring=scoring) results.append (cv_results) names.append (name) msg = "%s:%f (%f)"% (name, Cv_results.mean (), CV_RESULTS.STD ()) print (msg) # BoxPlot Algorithm comparisonfig = Pyplo T.figure () fig.suptitle (' Algorithm Comparison ') ax = Fig.add_subplot (111) Pyplot.boxplot (results) Ax.set_xticklabels ( Names) Pyplot.show ()
LR: 0.860769 (0.021660)LDA: 0.852972 (0.021163)KNN: 0.896184 (0.016646)CART: 0.920491 (0.012471)NB: 0.857179 (0.015487)SVM: 0.921091 (0.016828)
What can be seen, see the SVM and the CART effect is relatively good.
5 Lifting Results
The ascending part, how to use the lifting algorithm.
such as Random forest. Xgboost
from sklearn.ensemble import RandomForestClassifiernum_trees = 100max_features = 3kfold = KFold(n_splits=10, random_state=7)model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)results = cross_val_score(model, X, Y, cv=kfold)print(results.mean())# 0.954696013379
from sklearn.ensemble import GradientBoostingClassifierseed = 7num_trees = 100kfold = KFold(n_splits=10, random_state=seed)model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)results = cross_val_score(model, X, Y, cv=kfold)print(results.mean())# 0.953197209185
As you can see, the improvement of the two algorithms to a single algorithm is obvious. Further, you can also continue to adjust the number of tree, but the effect should be almost.
6 Show Results
Here's how to save the algorithm, and how to take it out and apply it.
#storefrom sklearn.model_selection import train_test_splitfrom pickle import dumpfrom pickle import loadX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=7)from sklearn.ensemble import GradientBoostingClassifierseed = 7num_trees = 100kfold = KFold(n_splits=10, random_state=seed)model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)model.fit(X_train, Y_train)# save the model to diskfilename = ‘finalized_model.sav‘dump(model, open(filename, ‘wb‘))# some time later...# load the model from diskloaded_model = load(open(filename, ‘rb‘))result = loaded_model.score(X_test, Y_test)print(result)
7 PostScript
This paper shows how to apply the predictive process of machine learning to the actual project through the user churn rate problem.
- From a business perspective, this is just a demo-based application, and the actual scenario can be much more complex.
- From the perspective of the process, through the analysis of the data can further improve the performance of the algorithm, for some features, can take a different approach. For example, the processing of missing values, which is complete, eliminates this step.
- For some steps, the explanation here is not very detailed, readers can refer to the blog mastering the Python Machine learning series
Reference articles
1 http://blog.yhat.com/posts/predicting-customer-churn-with-sklearn.html
2 http://blog.csdn.net/han_xiaoyang/article/details/49797143 Cold Yang Small blog.
Learn the machine learning forecasting process (telecom customer churn rate problem)