This article mainly talks about "variable selection" "Model development" "scoring card creation and scale" variable analysis
First of all, we need to determine whether there is a collinearity between variables, if there is a high degree of correlation, just save the most stable, the highest predictive power. It needs to be tested by VIF (variance inflation factor), which is the variance expansion factor.
Variables are divided into continuous variables and classification variables. In the scoring card modeling, the variable compartment (binning) is a term for discretization of continuous variables (discretization). To convert the logistic model into the form of a standard scorecard, this link must be completed. In the development of credit scoring card, there are commonly used equidistant section, equal depth section and optimal segment.
Single factor analysis is used to detect the predictive strength of each variable, and the method is WOE, Iv. WOE
WOE (weight of Evidence) literal evidence weights for each group after the compartment. Assume that good is a good customer (not default), bad for the wrong customer (default).
Woei=ln (Pgoodpbad) =ln (good) =ln (#goodi #goodt#badi#badt) (2) (3) (4) (2) W o E i = l n (P g o o d P b a D) (3) = L N ( G o O D accounted for B a D (4) = L N (# g o o d i # g o D t # b a D i # B a D t)
#good (i) indicates the number of good in each group, #good (T) is the total number of good;
IV
IV (information value) measures the amount of information in a given variable, and the formula is as follows:
IV=∑I=1N (good ratio −bad) ∗woei (m) i V =∑i = 1 N (g o D ratio −b a D ratio) ∗w o E i
n is the group number of groups;
IV can be used to represent the predictive power of a variable.
IV Predictive Ability <0.03 0.03~0.09 low 0.1~0.29 medium 0.3~0.49 high >=0.5 extremely high
According to the IV value, the box structure is adjusted and the woe and IV are recalculated until the maximum of IV is reached, at which point the compartment effect is best. Group general principle the difference in large groups between groups is smaller than 5% of each group must have good and bad two classifications for illustrative
For example, by age group, General carry out the box, we all like according to juvenile, youth, middle age, old age several major categories to group, but the effect is not necessarily good: ages good bad WOE <18 (50/33040/220) =−0.182321556793955 ( 6) (6) L N (50/330 40/220) =−0.182321556793955 18~30 ln (100/33060/220) =0.105360515657826 (7) (7) l N (100/ 60/220) = 0.105360515657826 30~60 ln (100/33080/220) =−0.182321556793955 (8) (8) L N (100/330 80/220) = −0.182321556793955 >60 Ln (80/33040/220) =0.287682072451781 (9) (9) L N (80/330 40/220) = 0.287682072451781 A LL 330 220
iv= (50330−40220) ∗ln (50/33040/220) + (100330−60220) ∗ln (100/33060/220) + (100330−80220) ∗ln (100/33080/220) + ( 80330−40220) ∗ln (80/33040/220) =0.0372027069982804 (a) (a) (I) (50 330−40 220) ∗l N (50/330 40/ 220) (11) + (100 330−60 220) ∗l N (100/330 60/220) (12) + (100 330−80 220) ∗l N (100/330 80/220) (1 3) + (80 330−40 220) ∗l N (80/330 40/220) (14) = 0.0372027069982804
According to the IV value can be seen, low predictive capacity, it is recommended to readjust the compartment.
Build a model
First, the data division, general 70% Training sets, 30% test sets. The training set is used in the training model, and the test set is used to detect the training model.
In general, logistic regression is used to establish model and training model. The model is modeled to predict the sample.
Calculation method of scoring card score card
Odds is the ratio of good user probability (p) and bad user probability (1-p).
Odds=p1−p (+) o d d s = P 1−p
The score scale set by the scoring card can be defined by the current expression that the score is expressed as a logarithmic ratio. The formula is as follows:
Score Total =A+B∗LN (odds) (total) S C o r e always = A + b∗l n (o d d s)
Note: If odds is bad customer probability bad customer probability good customer probability, odds should take the reciprocal, and then through LN l n conversion then B is preceded by a minus sign. So in some places this formula B is a minus sign before.
The specific point score for the set ratio of θ0θ0 (i.e. odds) is P0 P 0, and the 2θ0 2θ0 point is p0+pdo P 0 + P D O. Bring the formula above to get:
{p0p0+pdo=a+bln (θ0) =a+bln (2θ0) (m) {P 0 = a + b l n (θ0) p 0 + p D O = a + b l n (2θ0)
The value of a and B can be obtained by solving the above formula:
{ba=pdoln2=p0−bln (θ0) (m) {B = P D O L n 2 A = P 0−b l N (θ0)
The values of P0 P 0 and PDO P D o are all known constants, and A and B values are calculated with the score s C o r e Formula, and the score card scores under different θ0θ0 are obtained.
The θ0θ0 is odds o D S, which can be computed by the results p p of logistic regression model evaluation.
To this place, a sample of the scoring results have been calculated. Score Distribution
In the actual application, we will calculate each variable of each compartment corresponding to the score. When a new user is generated, the value of each compartment is added, and the initial basis is added to the final result.
If a user changes a variable, from one box to another, simply replace the value of the box after the update, and then add it again to get a new score.
As we all know, the assumption model results in P, according to the logistic regression calculation formulas are:
P=11+E−ΘTX (246) (246) p = 1 1 + e−θt X
After conversion to get
ln (p1−p) =ΘTX (247) (247) l n (P 1−p) =θt x
So
Score Total =a+b∗ (ΘTX) =a+b∗ (W0+W1X1+⋅⋅⋅+WNXN) = (a+b∗w0) +b∗w1x1+⋅⋅⋅+b∗wnxn (248) (249) (248) S C o r e total = a + b∗ (θt x) = a + B ∗ (w 0 + W 1 x 1 + + w n x N) (249) = (A + b∗w 0) + b∗w 1 x 1 + + b∗w n x N
Here w1,w2,..., WN W1,w2,...,wn is a factor of x1,x2 xn,..., x1,x2,...,xn in logistic regression.
(a+b∗w0) (A + b∗w 0) is the base score, B∗W1X1,⋅⋅⋅,B∗WNXN b∗w 1 x 1, ... , b∗w n x n corresponds to the assigned score for each variable.
Each variable in the previous step has a separate box operation and is divided into several classes. So the next step, the score for each variable, multiplied by the woe of each compartment in the variable, gets the result of each box. Variable sub-box category score base score-(A+B∗W0) (A + b∗w 0) x1 x 1 1
2
...
I (b∗w1x1) ∗woe11 (b∗w 1 x 1) ∗w O E 11
(b∗w1x1) ∗woe12 (b∗w 1 x 1) ∗w O E 12
···
(b∗w1x1) ∗woe1i (b∗w 1 x 1) ∗w O E 1 I x2 x 2 1
2
...
J J (b∗w1x1) ∗woe21 (b∗w 1 x 1) ∗w O E 21
(b∗w1x1) ∗woe22 (b∗w 1 x 1) ∗w O E 22
···
(b∗w1x1) ∗woe2j (b∗w 1 x 1) ∗w O E 2 J ··· ··· xn x N 1
2
...
K K (b∗w1x1) ∗woen1 (b∗w 1 x 1) ∗w O E N 1
(b∗w1x1) ∗woen2 (b∗w 1 x 1) ∗w O E N 2
···
(b∗w1x1) ∗woenk (b∗w 1 x 1) ∗w O E n k
After the above steps are completed, if a new user, we only need to the user each variable corresponding to the box to get its corresponding woe value, and then based on the formula above to calculate the sample under each variable score. Finally, the scores of all the variables are added, which is the result of the final score.
Finally, in terms of feature selection, it is not the more dimensional the better. In a rating card, there are generally no more than 15 dimensions. The weights of each variable can be determined according to the logistic regression model coefficients, and the variable with high weight is retained. A variable with a correlation coefficient greater than 0.7 usually retains only one.
Reference
Credit risk scoring card research Mamdouh Refaat
Internet financial age consumer credit scoring modeling and application Canliang
Hands-on teaching you to use R language to establish a credit scoring model
"statistical learning method" Hangyuan Li