User analysis based on R language

Source: Internet
Author: User
Tags id3

1. Basic Analytical theory

The

C5.0 is an algorithm in the decision tree model, which was developed by J R Quinlan in 79, and the ID3 algorithm is proposed, mainly for the discrete attribute data, and then continuously improved to form C4.5, which increases the discretization of the continuous properties of the team on the basis of ID3. C5.0 is a classification algorithm C4.5 applied to large datasets, and is mainly improved in terms of execution efficiency and memory usage. The
C4.5 algorithm is a revised version of the ID3 algorithm, using Gainratio to improve the method, select the largest gainratio of the partition variable as a criterion, to avoid the ID3 algorithm over-fit problem. The
C5.0 algorithm is a revised version of the C4.5 algorithm, which is suitable for processing large data sets, using boosting method to improve the accuracy of the model, also known as Boostingtrees, the software is faster and consumes less memory resources. The
Decision tree model, also known as the rule inference model. Through the training sample learning, the establishment of classification rules, according to the classification rules, to achieve the classification of new samples, there are guided (supervised) learning methods, there are two types of variables: Target variable (output variable), attribute variable (input variable). The main difference between the
Decision tree model and the general Statistical classification model is that the classification of decision trees is based on logic, and the general Statistical classification model is based on non-logic. The
common algorithms are Chaid, CART, Quest, and C5.0. The "difference" between the groups for each decision requirement is the largest. The main difference between the various decision tree algorithms is the difference in the way the "difference" is measured. The
Decision tree is very good at dealing with non-numeric data, which eliminates much data preprocessing when compared to the neural network's intelligent processing of numerical data. The
C5.0 is one of the classical decision tree model algorithms, which can generate multi-branch decision tree, target variable is categorical variable, using C5.0 algorithm can generate decision tree or rule set. The C5.0 model splits the sample based on the field of maximum information gain that can be coupled. The subset of the sample that is determined for the first split is then split again, usually based on another field, which repeatedly guides the subset of samples that cannot be split. Finally, we re-capture the lowest-level splits in the eye, and the subset of samples that have no significant contribution to the model values are presented or trimmed.
C5.0 Advantages:
The C5.0 model is robust in the face of problems with data omission and input fields, and the
C5.0 model is easier to understand than some other types of models, and the rules for model exits are very straightforward to interpret;
C5.0 also provides powerful techniques to improve the accuracy of classification. The
C5.0 algorithm
C5.0 algorithm chooses branch variables based on the decline rate of information entropy as the basis for determining the best branch variable and dividing threshold value. The decline of information entropy means that the uncertainty of information decreases.

2. Involving first closing the package
Library (C50)
Library (DPLYR)

3. Example

library(C50)data(churn)churn_data <- churnTrainoutcome_name <- ‘churn‘# make the outcome variable easier to readchurn_data[,outcome_name] <- as.factor(ifelse(churn_data[,outcome_name]==‘yes‘,‘Does_Churn‘, ‘Stays‘))

Interesting_interactions <-function (The_data_frame, outcome_name) {# install.packages (...) if missing require (C50) Require (DPLYR) C5model <-C5.0 (x = The_data_frame[,setdiff (names (the_data_frame), outcome_name)], y = the_d Ata_frame[,outcome_name], rules = True) Rule_munger <-capture.output (c5model$rules, split = True) Rule_munge R <-Strsplit (Rule_munger, ' \\\\n ') Rule_munger <-gsub (x = rule_munger[[1]], pattern = ' \\\\|\ ' ', replacement = ') [-1] # Extract results into Data frame format rule_count <-0 conds_last <-0 cover_last <-0 ok_last < -0 lift_last <-0 class_last <-0 rules <-C () for (entry in Rule_munger) {print (entry) if (substr (entry,1,5) = = ' Rules ') print (entry) # track only lines starting with conds or Type-ignore rest if (subst        R (entry,1,5) = = ' Conds ' |    substr (entry,1,4) = = ' type ') {if (substr (entry,1,5) = = ' Conds ') {rule_count <-Rule_count + 1    Conds_last <-strsplit (x = strsplit (x = entry, split = "") [[1]][1], split = ' = ') [[1]][2] # Cover I  s the number of training cases covered by the rule cover_last <-strsplit (x = strsplit (x = entry, split          = "") [[[1]][2], split = ' = ') [[1]][2] # OK is the number of positives covered by class, Ok_last <- Strsplit (x = strsplit (x = entry, split = "") [[1]][3], split = ' = ') [[1]][2] # Lift is the estimated accuracy of        The Rule lift_last <-strsplit (x = strsplit (x = entry, split = "") [[1]][4], split = ' = ') [[1]][2] # class predicted by Class_last <-strsplit (x = strsplit (x = entry, split = "") [[1]][5], split = ' = ' ) [[1]][2]} if (substr (entry,1,4) = = ' type ') {# variable type type_last <-STRs Plit (x = strsplit (x = entry, split = "") [[1]][1], split = ' = ') [[1]][2] Att_last <-strsplit (x = STRSPL It (x = entry, split = "[[1]][2], split = ' = ') [[1]][2] # sniff out optional parameters elts_last <-' if (grepl (x = entry, pattern = ' elts ')) {elts_last <-strsplit (x = entry, split = "elts=") [[1]][2]} cut_last <-' I                F (GREPL (x = entry, pattern = ' cut ')) {cut_last <-strsplit (x = Strsplit ( x = entry, split = "cut=") [[[1]][2], split = ') [[1]][1]} Val_last & lt;-' if (grepl (x = entry, pattern = ' val ')) {val_last <-strsplit (x = entry, split = "val=") [[1]][2 ]} result_last <-' if (grepl (x = entry, pattern = ' result ')) {Result_last &lt            ;-Strsplit (x = entry, split = "result=") [[1]][2]} rules <-Rbind (rules, C ( Rule_count, Conds_last, Cover_last, Ok_last, Lift_last, TypE_last, Att_last, Elts_last, Result_last, Cut_last, Val_last,         class_last)}}}} if (!is.null (rules)) {rules <-Data.frame (rules) Names (rules) <-C (' rule_number ', ' conditions ', ' cover ', ' True_pos ', ' lift ', ' type ', ' attrib    Ute ', ' elts ', ' cut ', ' result ', ' value ', ' outcome ') rules[, 1:6] <-sapply (rules[, 1:6], As.character) rules[, 1:6] <-sapply (rules[, 1:6], As.numeric) if (Length (Unique (Rules$rule_number) > 0)) {rules%& gt;% Dplyr::arrange (desc (lift)), rules}, return (rules)}

 

results <- interesting_interactions(the_data_frame = churn_data, outcome_name = outcome_name)

print_rules <- function(rules_found, rulenum) { print(‘‘) print(paste0(‘Rule #‘, rulenum)) dplyr::filter(rules_found, rule_number == rulenum) -> pulled_rule dplyr::select(pulled_rule, cover, true_pos, outcome) %>% head(1) -> rule_def dplyr::select(pulled_rule, attribute, elts, cut, result, value) -> conditions print(paste0(‘In ‘, rule_def$cover, ‘ cases, ‘, round(rule_def$true_pos/rule_def$cover,2)*100, ‘% customers ‘, as.character(rule_def$outcome),‘ when:‘)) for (cond_id in seq(nrow(conditions))) { cond <- conditions[cond_id,] #attribute elts cut result value if (nchar(as.character(cond$elts)) > 0) { print(paste0(cond$attribute, ‘: ‘, cond$elts)) } else if (nchar(as.character(cond$value)) > 0) { print(paste0(cond$attribute, ‘ == ‘, cond$value)) } else { print(paste0(cond$attribute, " ", cond$cut, " ", cond$result)) } } print(‘‘)}

for (rule_number in unique(results$rule_number))  print_rules(results, rule_number)

# # [1] "" # # [1] "Rule #1" # # [1] "in cases, 100% customers Does_churn When:" # # [1] "International_plan = = yes" # # [1] "to Tal_intl_calls < 2 "# # [1]" "# # [1]" "# # [1]" Rule #2 "# # [1]" in cases, 100% customers Does_churn When: "# # [1]" Inter National_plan = = yes "# [1]" Total_intl_minutes > 13.1 "# # [1]" # # [1] "" # # [1] "Rule #3" # # [1] "In cases, 100% Cust Omers Does_churn When: "# # [1]" Total_day_minutes < 120.5 "# # [1]" Number_customer_service_calls > 3 "# # [1]" "# # [1]" "# # [1]" Rule #4 "# # [1]" In the cases, 96% customers Does_churn When: "# # [1]" Total_day_minutes < 160.2 "# # [1]" Total_ev E_charge < 19.83 "# # [1]" Number_customer_service_calls > 3 "# # [1]" # # [1] "" # # [1] "Rule #5" # # [1] "in cases, 95  % customers Does_churn When: "# # [1]" International_plan = = no "# # [1]" Voice_mail_plan = = no "# # [1]" Total_day_minutes > 246.60001 "# # [1]" Total_eve_charge > 20.5 "# # [1]" # # [1] "# # [1]" Rule #6 "# # [1]" in cases, 93% customers does_ch Urn When: "# # [1]" Total_Day_minutes < 264.39999 "# # [1]" Total_eve_calls < "# # [1]" Total_eve_charge < 12.05 "# # # [1]" Number_customer_se Rvice_calls > 3 "# # [1]" "# # [1]" "# # [1]" Rule #7 "# # [1]" In the cases, 90% customers Does_churn When: "# # [1]" Voice_mai L_plan = = no "# # [1]" Total_day_minutes > 223.2 "# # [1]" Total_eve_charge > 20.5 "# # [1]" Total_night_minutes > 174. 2 "# # [1]" "# # [1]" "# # [1]" Rule #8 "# # [1]" in cases, 79% customers Does_churn When: "# # [1]" Voice_mail_plan = no "# # [1] "Total_day_minutes > 223.2" # # [1] "Total_eve_charge > 20.5" # # [1] "# # [1]" "# # [1]" Rule #9 "# # [1]" in the case S, 62% customers Does_churn When: "# # [1]" Total_day_minutes > 223.2 "# # [1]" Total_eve_charge > 20.5 "# # [1]" "# # [1] "# # [1]" Rule #10 "# # [1]" in 211 cases, 60% customers Does_churn When: "# # [1]" Total_day_minutes > 264.39999 "# # [1]" " # # [1] "" # # [1] "Rule #12" # # [1] "in 768 cases, 97% customers stays When:" # # [1] "International_plan = = no" # # [1] "voice_m Ail_plan = = yes "# # [1]"Number_customer_service_calls < 3 "# # [1]" "# # [1]" "# # [1]" Rule #11 "# # [1]" In 2221 cases, 97% customers stays when: "# # [1] "International_plan = = no" # # [1] "Total_day_minutes < 223.2" # # [1] "Number_customer_service_calls < 3" # # # [1] " "# # [1]" "# # [1]" Rule #13 "# # [1]" In the cases, 96% customers stays When: "# # [1]" Account_length < 123 "# # [1]" total_e Ve_minutes < 187.7 "# # [1]" Total_night_minutes < 151.89999 "# # [1]" # # [1] "" # # [1] "Rule #14" # # [1] "in cases, 9 8% customers stays When: "# # [1]" International_plan = = no "# # [1]" Voice_mail_plan = yes "# # [1]" Total_day_minutes > 26  4.39999 "# # [1]" "# # [1]" "# # [1]" Rule #15 "# # [1]" In 1972 cases, 96% customers stays When: "# # [1]" Total_day_minutes <  264.39999 "# # [1]" Total_intl_minutes < 13.1 "# # [1]" Total_intl_calls > 2 "# # [1]" Number_customer_service_calls < 3 "# # [1]" "# # [1]" "# # [1]" Rule #16 "# # [1]" in 197 cases, 95% customers stays When: "# # [1]" Total_day_minutes > 120.5 "# # [1]" Total_day_minutEs < 160.2 "# # [1]" Total_eve_charge > 19.83 "# # [1]" # # [1] "" # # [1] "Rule #17" # # [1] "in 155 cases, 94% customers S Tays When: "# # [1]" Voice_mail_plan = = no "# # [1]" Total_day_minutes < 277 "# # # [1]" Total_night_minutes < 126.9 "# # # [1] "# # [1]" "# # [1]" Rule #18 "# # [1]" In 1675 cases, 89% customers stays When: "# # [1]" Total_day_minutes > 160.2 "# # [1]" Total_day_minutes < 264.39999 "# # [1]" Total_eve_charge > 12.05 "# # [1]" # # [1] "" # # [1] "Rule #19" # # [1] "in 434 CAs ES, 89% customers stays When: "# # [1]" Total_eve_charge < 12.26 "# # # [1]" "

User analysis based on R language

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.