Because of my professional relationship, I tend to focus on the model when I study a credit score, some time ago has been a lot of confusion, this week seriously read an article, finally have a little understand, so to simply summary (this matter can not be said too thin), summary I have to learn curl go.
The most common model for credit scoring is logistic regression, a generalized linear model that deals with the two classification dependent variables. The theoretical basis of this model is relatively solid, but for different problems of course there are some special treatment methods, my biggest confusion lies in the modeling of the classification of the independent variable processing method. Because of the need to make a score card, usually when the scoring model is established, the independent variable is discretized (equal width cutting, high cutting, or cutting using decision tree), but the model itself can not directly accept the input of the classification independent variable, so the independent variable should be processed again. There are two kinds of conventional approaches: doing dummy variables, and doing the target based variable coding.
The dummy variable is a relatively natural operation, for example, an independent variable m has 3 kinds of values are m1,m2,m3, then can construct two dummy variable m1,m2: When M takes M1, M1 takes 1 and M2 0; When m takes M2, M1 takes 0 and M2 1; when M takes M3, M1 take 0 and M2 take 0. Thus, the value of M1 and M2 determines the value of M. The reason why the M3 variable is not constructed is based on the consideration of information redundancy and multiple collinearity. However, there are some drawbacks to constructing dummy variables, such as the inability to calculate their credit score for each value of the argument, and the possibility that a dependent variable may be partially discarded when the regression model filters the variable. Another way to deal with classification variables is to encode them based on the target, and it is more common to use woe encoding in credit scoring. Woe is called the weight of Evidence (Weight of Evidence), which represents an effect on the default ratio when a variable takes a value. It's quite a mouthful to explain, draw a table first. In this table, the ID is one of the arguments that we consider, and he has three kinds of values. 1 and 0 respectively represent the number of default samples and normal samples, and the last column is the corresponding woe. In the more complex Chinese to explain: When the ID value is A1, the ratio of the default sample to the total default sample is calculated at this time, then the ratio of the normal sample to the total PP1 is calculated pp0, then the natural logarithm of the ratio of the two ratios is obtained (PP1/PP0), which is the A1 corresponding to the woe. In fact, after a simple change, woe can be considered to measure the difference between the default risk ratio (odds ratio) and the total default risk ratio when the independent variable takes AI. Because of this, it is intuitively possible to think that woe implies a certain effect of the variable value on the target variable (default probability), so it is possible to automatically encode the independent variable: when the independent variable takes the AI, the encoding is the corresponding Woei.
Woe's approach is also very intuitive, and his form of calculation is so similar to the target variable in logit regression, so even I think that this method can be used enjoy. But I'm always confused: Does this approach have other implications, and it has nothing in the nature to do with the dummy variable approach.
So I found some textbooks to read, finally have some new understanding.
This paper introduces model variable woe and iv principle, in order to express conveniently, the model target scalar is 1 as the default user, for the target variable is 0 as normal user; then woe (weight of Evidence) is actually an influence on the default ratio when the variable takes a certain value, How to understand this sentence. I'm going to explain this by using an icon below.
The woe formula is as follows:
The age of Ages is an independent variable in the table, because age is a continuous type of independent variable, it needs to be discretized, assuming that discretization is divided into 5 groups (as for how to group, will be explained in later topics), #bad和 #good represents the number of users in these five groups defaulting and normal users, The last column is the calculation of the woe value, which can be seen from the formula after the change, woe reflects the difference between the normal user ratio and the total default user to the normal user in each group under the independent variable. , so it is intuitive to think that woe contains the influence of the value of the independent variable on the target variable (default probability).
In addition, the Woe calculation form is similar to the logistic transformation of the target variable in logistic regression (LOGIST_P=LN (p/1-p)), so that the independent variable woe value can be substituted for the original variable value.
After the woe, let's talk about IV:
The Formula IV is as follows:
In fact, IV measures the amount of information in a given variable, which is equivalent to a weighted summation of the woe value of a variable, and the size of the value determines the degree to which the independent variable affects the target variable; from another point of view, the Formula IV is very similar to that of information entropy.
In fact, in order to understand the meaning of woe, the evaluation of the effect of the scoring model needs to be considered. Because all the processing of model-independent variables when we are modeling is essentially to improve the effect of the model.
In some previous studies, I also summed up the evaluation of the effectiveness of the two classification model, especially the ROC curve. In order to describe the meaning of woe, we really need to start from the ROC. Still, draw a table first.
The data comes from the famous German credit DataSet and takes one of the arguments to illustrate the problem. The first column is the value of the argument, n indicates the number of samples corresponding to each value, N1 and N0 respectively indicate the number of default samples and normal samples, p1 and p0 respectively represent the proportion of default samples and normal samples, CUMP1 and cump0 respectively represent the cumulative sum of P1 and P0, Woe is the woe (ln (p1/p0)) that corresponds to each value of the argument, and the IV is woe* (P1-P0).
For the sum of IV (which can be regarded as a weighted summation of the woe), we get the IV (information value), which is one of the indicators of the effect of the independent variable on the target variable (similar to those of Gini,entropy), here is 0.666, it seems a little too big, awkward.
The above process studies the effect of an independent variable on the target variable, in fact, can also be regarded as a single independent variable of the scoring model, further, you can directly take the value of the argument as a score of a credit score, you need to assume that the independent variable is some sort of ordered variable, That is, the objective variable is predicted directly based on the ordered independent variable.
It is based on this perspective that we can unify the "evaluation of model effects" with the "independent variable selection and coding" process. Selecting the appropriate independent variables and encoding them properly is in fact the selection and construction of independent variables with high predictive power to the target variable (predictive), and it is also considered that the model effect of the univariate scoring model based on these independent variables is also better.
Take the above table for example, the CUMP1 and cump0, in some ways we do the ROC curve of TPR and FPR. For example, at this point, the rating is A12,a11,a14,a13, and if A14 is cutoff, then Tpr=cumsum (p1) [3]/(SUM (p1)), Fpr=cumsum (P0) [3]/(SUM (p0)), Cump1[3] and Cump0[3]. So we can draw a corresponding ROC curve.
It can be seen that the ROC is not very good-looking. Also learned before, the ROC Curve has measurable index AUC, refers to the area below the curve. This area actually measures the distance between TPR and FPR.
According to the description above, TPR and FPR from another perspective, can be understood as the independent variable (that is, the score of some scoring rules) about the conditional distribution of 0/1 target variables, such as TPR, i.e. CUMP1, which is when the target variable takes 1 o'clock, the independent variable (score score) of a cumulative distribution. When the distributions of these two conditions are far away, it is proved that the independent variable has a better identification degree to the target variable.
Since the conditional distribution function can describe this recognition capability, then the conditional density function is not OK. This leads to the concept of IV and woe. In fact, we can also measure the distance of two conditional density functions, which is IV. This can be seen from the formula IV, Iv=sum (p1-p0) *log (p1/p0), where P1 and P0 are the corresponding density values. The definition of IV is derived from the relative entropy, which can still see the shadow of X*LNX.
So far, it can be concluded that the effect of the evaluation scoring model can be considered from the two angles of "The distance of conditional distribution function" and "the distance of conditional density function", so as to get the two indexes of AUC and IV respectively. These two indicators, of course, can also be used as an indicator for the selection of independent variables, and iv seems to be more commonly used. And woe is a major component of IV.
So, why do you use woe to encode your own variables? The main two considerations are: improve the predictive effect of the model and improve the comprehensible of the model.
First of all, a different ROC result can be obtained for a scoring rule that already exists, such as the above a12,a11,a14,a13, for various function changes. However, if this function change is monotonous, then the ROC curve does not actually change. Therefore, to improve the ROC, we must hope that the grading rules do not monotonous transformation. The legendary NP lemma proves that a woe of the ROC's optimal transformation is the calculation of an existing score, which seems to be called a "conditional likelihood ratio" transformation.
Using the above example, we sort the scoring rules (that is, the value of the first column) based on the calculated woe value, and get a new scoring rule.
Here in accordance with woe done in reverse order (because the greater the woe the default probability is greater), as usual can draw the ROC line.
As you can see, after woe changes, the model works better. In fact, woe can also be replaced by default probability, there is no essential difference between the two. One of the main purposes of encoding an independent variable with woe is to achieve this "conditional likelihood ratio" transformation and maximal identification.
At the same time, there is a linear relationship between woe and default probability, so the nonlinear relationship between the independent variable and the target variable can be found through this woe encoding (e.g. U-shaped or inverted-u). On this basis, we can expect that the model to fit the coefficient of independent variables should be positive, if there is a negative result, should consider whether it is from the independent variable multiple collinearity effect.
In addition, woe encoding, the independent variable in fact has a certain standard of nature, that is, the independent variables within the value can be directly compared to each other (woe comparison), and different arguments between the various values can also be directly compared through the woe. Further, we can study the variation (fluctuation) of the internal woe value of the independent variable, and combine the coefficients of the model fitting to construct the contribution rate and relative importance of each independent variable.
Generally, the larger the coefficient, the greater the variance of the woe, the greater the contribution rate of the independent variable (similar to a certain variance contribution rate), which can be intuitively understood.
To sum up, when making credit scoring model, the processing process of the independent variables (including coding and screening) is based on the evaluation of the effect of the single variable model to a large extent. In this evaluation process, ROC and IV study the influence of the independent variable on the target variable from different angles, and based on this investigation, we use the woe value to encode the classification independent variable, so that we can understand more intuitively the effect and direction of the independent variable on the target variable, and improve the prediction effect.
In this summary, it seems that the modeling process of credit scoring is more of an analytical process than a model-fitting process. As a result, we do not seem to do much to study the parameters of the model, and so on, and focus on the study of the relationship between the variables and the target variables, on the basis of the independent variable to do the screening and coding , the prediction effect of the model is evaluated again, and the effectiveness of each independent variable is evaluated accordingly.