[Machine learning Combat] use Scikit-learn to predict user churn _ machine learning

Source: Internet
Author: User
Tags intl vmail ggplot

Customer Churn

"Loss rate" is a business term that describes the customer's departure or stop payment of a product or service rate. This is a key figure in many organizations, as it is usually more expensive to get new customers than to retain the existing costs (in some cases, 5 to 20 times times the cost).

Therefore, it is invaluable to understand that it is valuable to maintain customer engagement because it is a reasonable basis for developing retention policies and implementing operational practices designed to discourage customers from getting out of doors. As a result, companies are increasingly interested in developing better loss detection techniques, leading many to seek data mining and machine learning to acquire new and creative approaches.

This is a story about using Python to model customer churn.
Let's start with a specific implementation step: DataSet

The dataset I will use is a long-term telecom customer dataset that you can download here.

The data is simple. Each line represents a subscriber for a subscription. Each column contains customer attributes, such as phone numbers, call minutes that are used at different times of the day, the costs incurred by the service, the lifecycle account duration, and whether the customer is still a customer.

From __future__ Import Division
Import Pandas as PD
import numpy as np

CHURN_DF = pd.read_csv (' churn.csv ') 
  col_names = Churn_df.columns.tolist ()

print "Column names:"
print col_names

to_show = col_names[:6] + col _names[-6:]

print "\nsample data:"
churn_df[to_show].head (6)
Column Name:
  [' state ', ' account Length ', ' Area Code ', ' Phone ', ' Int ' L plan ', ' vmail plan ', ' vmail message ', ' Day mins ', ' Day Calls ', ' D  Ay Charge ', ' Eve mins ', ' Eve Calls ', ' Eve Charge ', ' Night mins ', ' Night Calls ', ' Night Charge ', ' Intl mins ', ' Intl Calls ', ' Intl Charge ', ' Custserv Calls ', ' churn? '
    State account   Length area  Code   Phone   Int ' l plan  vmail plan  Night Charge    Intl mins   Intl Calls  Intl Charge custserv Calls  ?
0   KS  128 415 382-4657    no  yes 11.01   10.0    3   2.70    1   False.
1   OH  415 371-7191    No  Yes 11.45   13.7    3   3.70    1   False.
2   NJ  137 415 358-1921    no  no  7.32    12.2    5   3.29    0   False.
3   OH  408 375-9999    Yes no  8.86    6.6 7   1.78    2   False.
4   OK  415 330-6626    Yes no  8.41    10.1    3   2.73    3   False.
5   AL  118 510 391-8027    Yes no  9.18    6.3 6   1.70    0   False.

The following code simply deletes the unrelated columns and converts the string to a Boolean value (because the model does not handle "yes" and "no" very well). The remaining number columns remain unchanged.

# Isolate Target data
Churn_result = churn_df[' churn? ']
y = np.where (Churn_result = = ' True. ', 1,0)

# We don ' t need these columns
to_drop = [' state ', ' Area Code ', ' Phone ', ' Ch Urn? ']
Churn_feat_space = Churn_df.drop (To_drop,axis=1)

# ' yes '/' no ' has to is converted to Boolean values
# NumPy Conver TS are from Boolean to 1. and 0. Later
yes_no_cols = ["Int ' L Plan", "Vmail plan"]
churn_feat_space[yes_no_cols] = Churn_feat_space[yes_no_cols ] = = ' yes '

# Pull out features for future use
features = churn_feat_space.columns

X = Churn_feat_space.as_ma Trix (). Astype (Np.float)

# This are important from
sklearn.preprocessing import standardscaler
scaler = Standardscaler ()
X = Scaler.fit_transform (X)

print "Feature space holds%d observations and%d features"% X.sha PE
print "Unique target labels:", Np.unique (y)

Many predictive variables care about the relative size of different features, even if these scales may be arbitrary. For example: The basketball team scored more points in each game than they were in a few orders of magnitude. But that does not mean that the latter is 100 times times less important. Standardscaler fixes the problem by normalization of each feature to about 1.0 to 1.0, preventing the model from being incorrect. Well, at least for that reason.

Well, I now have a feature space ' X ' and a set of target values ' y '. How good your model is.

Express, Test, cycle. The machine learning pipeline should be static. There are always new feature designs, new data to use, new classifiers to consider for each with unique parameter adjustments. For each change, it is important to be able to ask, "is the new version better than the previous one?" So what am I supposed to do?

As a good start, cross-validation will be used throughout the blog. Cross-validation attempts to avoid a fit (train and predict the same data point) while still generating predictions for each observational dataset. This is accomplished by systematically hiding different subsets of data while training a set of models. After training, each model predicts the hidden subset and simulates multiple train test splits. When completed correctly, each observation will have a "fair" corresponding prediction.

The following is an example of using the Scikit-learn library.

From sklearn.cross_validation import kfold

def RUN_CV (X,y,clf_class,**kwargs):
    # Construct a Kfolds object
    KF = Kfold (len (y), n_folds=5,shuffle=true)
    y_pred = Y.copy ()

    # Iterate through folds
    for Train_index, Test_index in KF:
        x_train, x_test = X[train_index], X[test_index]
        y_train = Y[train_index]
        # Initialize a CLA Ssifier with key word arguments
        CLF = Clf_class (**kwargs)
        clf.fit (X_train,y_train)
        Y_pred[test_index] = Clf.predict (x_test) return
    y_pred

I decided to compare three very unique algorithms to support vector machines, random forests and K nearest neighbors. To determine the correct rate of classifier prediction.

From SKLEARN.SVM import SVC/
sklearn.ensemble import randomforestclassifier as RF from
sklearn.neighbors Import Kneighborsclassifier as KNN

def accuracy (y_true,y_pred):
    # NumPy interprets True and False as 1. and 0.
  
   return Np.mean (y_true = = y_pred)

print "Support vector machines:"
print "%.3f"% accuracy (y, RUN_CV (x,y,svc)) C7/>print "Random forest:"
print "%.3f"% accuracy (y, RUN_CV (X,Y,RF))
print "k-nearest-neighbors:"
print "%.3f"% accuracy (y, RUN_CV (X,Y,KNN))
  

Comparison results:

        Support Vector machines:
           0.918
           random forest:
           0.943
           k nearest neighbor:
           0.896
Accuracy

Measurement is not a gold formula, always spits out large numbers for good models and always spits out low numbers for bad models. They convey some emotion about the performance of the model, and the work of the human designer is to determine the validity of each number. The problem with accuracy is that the results are not necessarily equal. If my classifier predicts that customers will lose, they don't, it's not the best, but it can be forgiven. However, if my classifier predicts that the customer will return, I have no action and then they stir ... It's really bad.

I'll use another built-in Scikit-learn function to construct the obfuscation matrix. A obfuscation matrix is a way of visualizing predictions by a classifier, and is simply a table that shows the distribution of predictions for a particular class. The x-axis represents the true category of each observation (if the customer is lost or not lost), and the y-axis corresponds to the category of the model forecast (if my classifier indicates that the customer will be lost or not lost).

From sklearn.metrics import confusion_matrix

y = Np.array (y)
class_names = Np.unique (y)

Confusion_ matrices = [
    ("Support Vector Machines", Confusion_matrix (Y,RUN_CV (X,Y,SVC))),
    ("Random Forest", Confusion_ Matrix (Y,RUN_CV (X,Y,RF)),
    ("K-nearest-neighbors", Confusion_matrix (Y,RUN_CV (X,Y,KNN))),
]

# Pyplot Code not included to reduce clutter from
churn_display import draw_confusion_matrices
%matplotlib

Inline Draw_confusion_matrices (Confusion_matrices,class_names)
What is the best algorithm.

Determine how difficult it is to predict variables that give a probability rather than a class. If I predict that there will be a 20% chance of rain tomorrow, I cannot eliminate all possible results of the universe. It rains or doesn't rain.

What helps is that predictive variables do not make a prediction that they are doing 3000+. So every time I predict an event that occurs 20% of a year, I can see how often these events actually happen. Here are the predictions and actual results of my use of pandas to help me compare random forest.

Import Warnings
warnings.filterwarnings (' Ignore ')

# estimators So predictions are all multiples of 0.1
  
   pred_prob = RUN_PROB_CV (X, y, RF, n_estimators=10)
Pred_churn = pred_prob[:,1]
Is_churn = y = = 1

# Number of Times a predicted probability are assigned to a observation
counts = pd.value_counts (pred_churn)

# Calculate True Probabilities
True_prob = {} for
prob in Counts.index:
    true_prob[prob] = Np.mean (Is_churn[pred_churn = = Prob])
    True_prob = PD. Series (True_prob)

# pandas-fu
counts = Pd.concat ([Counts,true_prob], Axis=1). Reset_index ()
Counts.columns = [' Pred_prob ', ' count ', ' true_prob ']
counts
  

Output results:

We can see that a random forest predicts 75 people will have a 0.9 probability of loss, whereas in reality the group has about 0.97 of the rate. Calibration and identification

Using the dataframe above I can draw a very simple graphic to help visualize the probability measurement. The x-axis represents the loss probability of a group of individuals randomly assigned to a given forest. The y-axis is the actual stirring rate within the group, and each point is scaled relative to the size of the group.

From ggplot import *
%matplotlib inline

baseline = Np.mean (Is_churn) ggplot (Counts,aes ' x= ', Pred_prob '
True_prob ', size= ' count ') + \
    geom_point (color= ' blue ') + \
    stat_function (fun = lambda x:x, color= ' red ') + \
    Stat_function (fun = lambda x:baseline, color= ' green ') + \
    Xlim ( -0.05,  1.05) + Ylim ( -0.05,1.05) + \
    Ggtitle (" Random Forest ") + \
    xlab (" predicted probability ") + ylab (" Relative frequency of outcome ")


You may also have noticed that I draw two lines I use Stat_function ().

The red line represents the perfect prediction for a given group, or when the predicted loss probability equals the result frequency. The Green line shows the loss probability of the baseline. For this dataset, it is about 0.15.

Calibration is a relatively simple measurement that can be summed up as follows: Events expected to occur 60% of the time should occur 60% of the time. For all individuals, I predict that the risk of loss is between 30% and 40%, and that the real loss rate for this group should be about 35%. For the graph above, think of it, my predictions are how close to the red line.

Original English Address: http://blog.yhat.com/posts/predicting-customer-churn-with-sklearn.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.