R: Employee turnover forecast actual combat __r Data analysis Modeling

Source: Internet
Author: User
Tags random seed ggplot
I. Background Introduction

Why do we have the best and most experienced staff to leave prematurely. The data came from the Kaggle and tried to predict what the next valuable employee would leave. Analyze the data to see what factors affect the resignation of employees, as well as the main reasons for predicting which outstanding employees will leave. Variable Description:

<textarea readonly= "readonly" name= "code" class= "Python" >
################### ============== Load Pack ========= ========== #################
#查看当前的工作目录好导入数据文件
getwd ()
#设置工作目录为需要导入的数据文件所在目录
setwd ("c:\\users\\ Administrator\\desktop\\ employee turnover forecast ")
library (PLYR)          # Rmisc Association package, if you need to load the DPLYR package, you must first load the PLYR packet
Library (DPLYR)         # Filter ()
library (Ggplot2)       # ggplot ()             
Library (DT)            # DataTable () to           create an interactive datasheet
Library (caret)         # createdatapartition () hierarchical sampling function
Library (rpart)         # Rpart ()
Library (e1071)         # Naivebayes ()  naive Bayesian library (
PROC)          # ROC ()  ROC Curve
Library (RMISC)         # multiplot ()           Split drawing area
################### ============= Import data ================== #################
hr <-read.csv ("HR_ Comma_sep.csv ")
#查看数据文件的前6行
data <-Head (HR)
</textarea>
second, descriptive analysis

Observe the main descriptive statistics of each variable

To explore the relationship between employee satisfaction, performance evaluation, monthly working time and working life and turnover.
Explore the number of participating projects, there is no promotion in five years and the relationship between salary and turnover

<textarea readonly= "readonly" name= "code" class= "Python" >
################### ============= Descriptive analysis ======== ========== ###############
# View the basic data structure of the data
str (HR)      
# The main descriptive statistics for the calculation of data
summary (HR)  

# Subsequent individual models require that the target variable must be a factor, and we convert the left type of HR from int to factor
Hr$left <-factor (hr$left, levels = C (' 0 ', ' 1 ')) 

# #----- To explore the relationship between employee satisfaction, performance evaluation and monthly working hours and whether or not to leave----########
# Draw a box line chart
box_sat <-ggplot (HR, AES (x = left, y = Satisfaction_level, fill = left)) +
  Geom_boxplot () + 
  THEME_BW () +  # a ggplot theme
  Labs (x = ' left ', y = ' Satisfaction_level ') # Set horizontal ordinate label
Box_sat
</textarea>

employee's satisfaction with the company and whether or not the employee is leaving the box line chart

<textarea readonly= "readonly" name= "code" class= "Python" >
# Draw performance evaluation and whether to leave the box line chart
Box_eva <-(HR, AES (x = left, y = last_evaluation, fill = left)) + 
  Geom_boxplot () +
  THEME_BW () + 
  Labs (x = ' left ', y = ' Last_ev Aluation ')

Box_eva
</textarea>

the box chart of performance evaluation and resignation

<textarea readonly= "readonly" name= "code" class= "Python" >
# Plotting the average monthly working hours and whether to quit the box line chart Box_mon <-Ggplot
(HR , AES (x = left, y = average_montly_hours, fill = left)) + 
  Geom_boxplot () + 
  THEME_BW () + 
  Labs (x = ' left ', y = ' Average_montly_hours ')

Box_mon
</textarea>

average monthly working hours and whether to leave the box line chart

<textarea readonly= "readonly" name= "code" class= "Python" >
# Draw a box chart of employee's working life and turnover in the company
Box_time <-  (HR, AES (x = left, y = time_spend_company, fill = left)) + 
  Geom_boxplot () + 
  THEME_BW () + 
  Labs (x = ' left ', y = ' Time_spend_company ')

box_time
</textarea>

The box Chart of employee's working life and whether to leave the company

<textarea readonly= "readonly" name= "code" class= "Python" >
# Merge These graphics in a drawing area, cols = 2 means typesetting as one row two columns
Multiplot (Box_sat, Box_eva, Box_mon, box_time, cols = 2)
</textarea>

to explore the relationship between employee satisfaction, performance evaluation, monthly working time and working life and turnover.

<textarea readonly= "readonly" name= "code" class= "Python" >
###-------Explore the number of projects involved, the relationship between salary and turnover in five years------###
# When drawing a number of items in a project, you need to convert this variable to a factor
Hr$number_project <-factor (hr$number_project,
                            levels = C (' 2 ', ' 3 ', ' 4 ', ') 5 ', ' 6 ', ' 7 ')
# Draw percentage of participating items and separations stacked bar chart
bar_pro <-ggplot (HR, AES (x = number_project, fill = left)) +
  Geom_ Bar (position = ' Fill ') + # position = ' Fill ' that draws a percentage stacked bar
  THEME_BW () + 
  Labs (x = ' left ', y = ' number_project ')

Ba R_pro
</textarea>

number of participating projects and whether or not to leave a stacked bar chart

<textarea readonly= "readonly" name= "code" class= "Python" >
# Draw the percentage of promotions and separations within 5 years. Stacked bar Chart
bar_5years <- Ggplot (HR, AES (x = As.factor (promotion_last_5years), fill = left)) +
  Geom_bar (position = ' fill ') + 
  THEME_BW () +
  labs (x = ' left ', y = ' promotion_last_5years ')
bar_5years
</textarea>

percentage of promotions and separations within 5 years stacked bar chart

<textarea readonly= "readonly" name= "code" class= "Python" >
# Draw a percentage of the payroll and whether to leave. Stacked bar chart
bar_salary <- (HR, AES (x = salary, fill = left)) +
  Geom_bar (position = ' Fill ") + 
  THEME_BW () + 
  Labs (x = ' left ', y = ' salary ')

bar_salary
</textarea>

percentage of salary and separations stacked bar chart

<textarea reaonly= "readonly" name= "code" class= "python"
merge these graphics in a drawing area, cols = 3 means typesetting is a row of three columns
Multiplot ( Bar_pro, Bar_5years, bar_salary, cols = 3)
</textarea>

third, the regression tree of modeling prediction

<textarea readonly= "readonly" name= "code" class= "Python" >
############## =============== to extract excellent employees ========== = ###################
# filter () is used to filter eligible samples
Hr_model <-filter (HR, last_evaluation >= 0.70 | time_spend_ Company >= 4
| number_project > 5
############### ============ Custom Cross-validation method ========== ##############
# Set 50 percent Cross-validation method = ' CV ' is to set cross-validation methods, Number = 5 means 50 percent cross-validation
Train_control <-Traincontrol ' CV ', number = 5)
</textarea>

<textarea readonly= "readonly" name= "code" class= "Python" > ################ =========== divided into sampling ============== ##### ##################### Set.seed (1234) # set random seed, in order to make each sample result consistent # According to the data of the dependent variable for 7:3 of the stratified sampling, return row index vector p = 0.7 means the sampling according to 7:3, # list=f is not returned Back to list, return vector index <-createdatapartition (hr_model$left, p = 0.7, List = F) Traindata <-,] # extract Hr_model[index from data x corresponds to the row index data as the training set TestData <-Hr_model[-index,] # The rest as Test set ##################### ============= regression tree ============= ###### ############### # Use the Trian function in the caret package to establish a decision tree model using the 50 percent crossover method of the training set # Left ~. The meaning is to model the dependent variable with all the arguments; Trcintrol is to control the modeling using that method.  Methon is to set which algorithm to use Rpartmodel <-train (left ~., data = traindata, Trcontrol = Train_control, methods
= ' Rpart ') # using the Rpartmodel model to predict the test set, ([7] means to eliminate the dependent variable of the test set) Pred_rpart <-predict (Rpartmodel, testdata[-7]) Newtestdata <-Cbind (testdata[-7],pred_rpart) # Create confusion Matrix, positive= ' 1 ' Set our positive example to "1" con_rpart <-table (Pred_rpart, Testdata$left) Con_rpart </textarea>





four, modeling prediction of the simple Bayesian

<textarea readonly= "readonly" name= "code" class= "Python" >
################### ============ naives Bayes = =  ============ #################
Nbmodel <-train (left ~., data = Traindata, Trcontrol
                    = Train_control, method = ' NB ')

pred_nb <-predict (Nbmodel, testdata[-7]) con_nb <-

table (PRED_NB, testdata$left)
CON_NB
</textarea>

v. Model Evaluation + Application



<textarea readonly= "readonly" name= "code" class= "Python" > ################### ================ ROC ============ ======== ################# # When using the ROC function, the predicted value must be numeric Pred_rpart <-as.numeric (As.character (Pred_rpart)) Pred_nb <- As.numeric (As.character (PRED_NB)) Roc_rpart <-Roc (Testdata$left, Pred_rpart) # Gets the information that is used for subsequent drawing #假正例率: (1-specififity[ True inverse rate]) specificity <-Roc_rpart$specificities # is the foundation for subsequent transverse axes, true counter rate sensitivity <-roc_rpart$sensitivities # Recall: Sensitivities, also true example rate # Draw ROC Curve #我们只需要横纵坐标 NULL to declare that we are not using any data P_rpart <-ggplot (data = NULL, AES (x = 1-specific  ity, y = sensitivity)) + geom_line (colour = ' red ') + # Draw Roc Curve Geom_abline () + # Draw Diagonal Annotate (' text ', x = 0.4, y = 0.5, label = paste (' auc= ', #text是声明图层上添加文本注释 # ' 3 ' is the round function Inside the parameters, retain three decimal round (ROC_RP ART$AUC, 3)) + THEME_BW () + # in figure (0.4,0.5) Add AUC value Labs (x = ' 1-specificity ', y = ' sensitivities ') # Set vertical axis label P_rpart </textarea> 

<textarea readonly= "readonly" name= "code" class= "Python" >
roc_nb <-roc (Testdata$left, PRED_NB)
Specificity <-roc_nb$specificities
sensitivity <-roc_nb$sensitivities p_nb <-
(data = NULL, AES (x = 1-specificity, y = sensitivity)) + 
  geom_line (colour = ' red ') + geom_abline () + 
  annotate (' text ', x = 0.4, y = 0.5, label = paste (' auc= ', 
                                                   round (ROC_NB$AUC, 3)) + THEME_BW () + 
  Labs (x = ' 1-specificity ', y = ' sensitivitie S ')

p_nb
</textarea>

Summary: The AUC value of the regression tree (0.93) > Naive Bayesian AUC value (0.839), finally we chose the regression tree model as our actual predictive model

<textarea readonly= "readonly" name= "code" class= "Python" >
######################### ============= application ===== ========####################

# using the regression tree model to predict the probability of classification, type= ' prob ' Set the forecast result to the probability of separation and the probability of not turnover
pred_end <-predict ( Rpartmodel, testdata[-7], type = ' prob ')

# Combined forecast results and predicted probability results
data_end <-cbind (Round (Pred_end, 3), Pred_rpart) c5/># rename
names (data_end) <-C (' pred.0 ', ' Pred.1 ', ' pred ') for the predictive results table to 

generate an interactive datasheet
DataTable (Data_end)
</textarea>

Finally, we generated a forecast result table: The first column of the forecast results table represents: Employee turnover probability (pred.0) The second column of the results table represents: Employee turnover probability (Pred.1) Forecast Results table The third column represents: Staff turnover (pred)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.