Scoring Card model analysis (woe, IV, ROC, information entropy) __ Machine learning

Source: Internet
Author: User
Tags scalar
Summary: Credit scoring card model is a mature forecasting method in foreign countries, especially in the field of credit risk assessment and financial risk control, which is a generalized linear model of two classified variables, which is based on the discretization of model variable woe coding mode. This paper focuses on the model variable woe and IV principles, in order to express the convenience of the model target scalar 1 as a default user, for the target variable of 0 for the normal user; then woe (weight of E

Credit Scoring card model is a mature forecasting method in foreign countries, especially in the field of credit risk assessment and financial risk control, which is a generalized linear model of two classified variables, which is based on the discretization of model variable woe coding mode.

This paper focuses on the model variable woe and IV principles, in order to express the convenience of the model target scalar 1 as a default user, for the target variable of 0 for the normal user; then woe (weight of Evidence) is actually an influence on the default ratio when the variable takes a certain value, How to understand this sentence. I'm going to explain this by using an icon below.

The woe formula is as follows:

Age

#bad

#good

Woe

0-10

50

200

=ln ((50/100)/(200/1000)) =ln ((50/200)/(100/1000))

10-18

20

200

=ln ((20/100)/(200/1000)) =ln ((20/200)/(100/1000))

18-35

5

200

=ln ((5/100)/(200/1000)) =ln ((5/200)/(100/1000))

35-50

15

200

=ln ((15/100)/(200/1000)) =ln ((15/200)/(100/1000))

More than 50

10

200

=ln ((10/100)/(200/1000)) =ln ((10/200)/(100/1000))

Summary

100

1000

The age of Ages is an independent variable in the table, because age is a continuous type of independent variable, it needs to be discretized, assuming that discretization is divided into 5 groups (as for how to group, will be explained in later topics), #bad和 #good represents the number of users in these five groups defaulting and normal users, The last column is the calculation of the woe value, which can be seen from the formula after the change, woe reflects the difference between the normal user ratio and the total default user to the normal user in each group under the independent variable. , so it is intuitive to think that woe contains the influence of the value of the independent variable on the target variable (default probability). In addition, the Woe calculation form is similar to the logistic transformation of the target variable in logistic regression (LOGIST_P=LN (p/1-p)), so that the independent variable woe value can be substituted for the original variable value.

After the woe, let's talk about IV:

The Formula IV is as follows:



In fact, IV measures the amount of information in a given variable, which is equivalent to a weighted summation of the woe value of a variable, and the size of the value determines the degree to which the independent variable affects the target variable; from another point of view, the Formula IV is very similar to that of information entropy.

In fact, in order to understand the meaning of woe, the evaluation of the effect of the scoring model needs to be considered. Because all the processing of model-independent variables when we are modeling is essentially to improve the effect of the model. In some previous studies, I also summed up the evaluation of the effectiveness of the two classification model, especially the ROC curve. In order to describe the meaning of woe, we really need to start from the ROC. Still, draw a table first.


The data comes from the famous German credit DataSet and takes one of the arguments to illustrate the problem. The first column is the value of the argument, n indicates the number of samples corresponding to each value, N1 and N0 respectively indicate the number of default samples and normal samples, p1 and p0 respectively represent the proportion of default samples and normal samples, CUMP1 and cump0 respectively represent the cumulative sum of P1 and P0, Woe is the woe (ln (p1/p0)) that corresponds to each value of the argument, and the IV is woe* (P1-P0). For the sum of IV (which can be regarded as a weighted summation of the woe), we get the IV (information value), which is one of the indicators of the effect of the independent variable on the target variable (similar to those of Gini,entropy), here is 0.666, it seems a little too big, awkward.

The above process studies the effect of an independent variable on the target variable, in fact, can also be regarded as a single independent variable of the scoring model, further, you can directly take the value of the argument as a score of a credit score, you need to assume that the independent variable is some sort of ordered variable, That is, the objective variable is predicted directly based on the ordered independent variable.

It is based on this perspective that we can unify the "evaluation of model effects" with the "independent variable selection and coding" process. Selecting the appropriate independent variables and encoding them properly is in fact the selection and construction of independent variables with high predictive power to the target variable (predictive), and it is also considered that the model effect of the univariate scoring model based on these independent variables is also better.

Take the above table for example, the CUMP1 and cump0, in some ways we do the ROC curve of TPR and FPR. For example, at this point, the rating is A12,a11,a14,a13, and if A14 is cutoff, then Tpr=cumsum (p1) [3]/(SUM (p1)), Fpr=cumsum (P0) [3]/(SUM (p0)), Cump1[3] and Cump0[3]. So we can draw a corresponding ROC curve.

It can be seen that the ROC is not very good-looking. Also learned before, the ROC Curve has measurable index AUC, refers to the area below the curve. This area actually measures the distance between TPR and FPR. According to the description above, TPR and FPR from another perspective, can be understood as the independent variable (that is, the score of some scoring rules) about the conditional distribution of 0/1 target variables, such as TPR, i.e. CUMP1, which is when the target variable takes 1 o'clock, the independent variable (score score) of a cumulative distribution. When the distributions of these two conditions are far away, it is proved that the independent variable has a better identification degree to the target variable.

Since the conditional distribution function can describe this recognition capability, then the conditional density function is not OK. This leads to the concept of IV and woe. In fact, we can also measure the distance of two conditional density functions, which is IV. This can be seen from the formula IV, Iv=sum (p1-p0) *log (p1/p0), where P1 and P0 are the corresponding density values. The definition of IV is derived from the relative entropy, which can still see the shadow of X*LNX.

It should be summed up to this point: the effect of the evaluation scoring model can be from "Conditional distribution function distance

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.