Automating operations with R language + logistic regression

Source: Internet
Author: User

Summary

Logistic regression is one of the most common two classification algorithms, because there is supervised learning, the training stage needs to input tags, while in the case of a lot of variables, need to go through some dimensionality reduction process, this article mainly explains if the R language to automate the realization of variable dimensionality and variable conversion, training, testing, Coverage and accuracy effects assessments, as well as the generation of a final scoring configuration table, where the configuration table can be generated automatically when tags and training data can be automated. Each of these steps has a detailed implementation code.

Main steps

Implementation Details 1. Generate Training Data

Like the following format

lable var1 var2 var3 var4 var5 var6 var7 var8 var9 var10

37 0 1012512056 1 158 2 5 1 2 250 2 40

48 0 1028191324 1 158 5 1 0 1 100 0 0

82 0 1042100363 1 158 3 15 8 17 88 7 46

105 0 1059904293 1 158 3 17 4 10 170 5 29

215 0 1056562444 1 158 3 20 10 15 133 3 15

219 1 405594373 1 158 2 8 5 1 800 0 0

309 0 1015664693 1 158 4 18 11 6 300 3 16

312 0 1032736990 1 158 2 6 3 14 42 0 0

319 1 1310159241 1 158 3 8 4 2 400 2 25

350 0 1026266596 1 158 5 34 18 15 226 5 14

380 0 1028432195 1 158 4 19 7 9 211 1 5

As in this example, there are 10 feature var1-var10, and a label, the first row is the variable name, the second row starts with a column, because the first column is the row number (here data is not complete, simple selection, in order to illustrate the problem).

This format is generated primarily to facilitate loading using the following code:

#读取数据

Data<-read.table ("Test_tranning.txt", header=t)

The data here is the original frame loaded

2. Dimension reduction and variable conversion

First of all two basic concepts, iv value and woe value, the full name of the IV is information value information, the IV value is one of the main methods of the logical regression pick variable, woe the full name is weight of evidence weight, calculation woe is the premise to calculate the IV value, When calculating woe, it is generally necessary to segment a variable first, such as an isometric fragment, and the woe value of each segment is actually a pair of values for the ratio of bad to good in this segment, namely:

Woe_i=ln (Bad_num/good_num)

The IV values that should be segmented are:

Iv_i = (bad_num-good_num) woe_i

Then the overall IV value of the variable is:

In general, the greater the value of the IV, indicating that the variable distinguishes between good and bad people's ability is stronger, so the general will pick the Big IV value of the variable as the model input. In fact, in this article, we do not use the IV value to pick the variable, but the GLM function returned by the P-value to pick, here introduced the concept of woe, mainly for the conversion of variables, that is, our intention is to convert the original variable to its corresponding woe value, the woe value as a new input variable, In a training test. The main benefit of this is the ability to convert some variables that are not normally distributed to normal distributions.

1) The first step is to segment

2) Then the woe of the variable and the IV value are calculated.

3) then merge woe values

Here's a simple explanation of why you want to merge the woe value, because we default on all variables, such as 10 segments, and in fact some of the adjacent segments of the woe value (that is, the ability to distinguish between good and bad people) is not very different, then we can actually merge these segments, that is, do not need to divide so many segments.

4) Change the variable to woe value

5) Summary
Summing up the steps is the following code:

3. Training and Testing

Constructs the training formula, randomly selects about 70% training data, 30% as the test data. Use the GLM function to train.

4. Generate a scoring configuration table

The results are similar to the following:

Woe name, index,intercept,coefficients, start value of variable segment, end value of variable segment, corresponding scoring weight

5. Effect evaluation

In fact, the most critical two indicators are coverage and accuracy, can also be combined with the KS index evaluation, such as in the process of automation, can be selected more than 95% accuracy or coverage of more than 90%, or the KS indicators, this according to their own application to decide.

Call directly like this:

The results of the assessment are similar to the following:

Automating operations with R language + logistic regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.