Getting Started with credit scorecard models (intelligent algorithms)

Source: Internet
Author: User

Getting Started with credit score card modelsIntelligent algorithm of 2016-07-26 intelligent algorithm

First, Background introduction:

4. Data collation (data cleansing process)

A large amount of sampled data must be sorted by data to really last into the model. In the data processing should pay attention to the logic of the examination, distinguish between "data missing" and "0", according to the logic to infer some values, look for abnormal data, evaluation is true. It is possible to preliminarily verify whether the sampled data is random or representative by calculating the minimum, maximum and average values.

Common cleanup processes include: Missing value analysis processing, univariate anomaly analysis (lof Analytical Processing or cluster analysis)

5. Variable Selection

The choice of variables has both the correctness of mathematical statistics and the explanatory power of the actual business of credit card.

General analysis of univariate statistical distributions and variable correlations:


Figure 3. Whether the variable distribution satisfies the hypothesis (Gauss)

Logistic regression also needs to examine multiple collinearity problems, but here, because of the small correlation between the variables, it is possible to preliminarily judge that there is no multi-collinearity problem, and of course we can use VIF (variance expansion factor) to test multiple collinearity problems after modeling. If there are multiple collinearity, it is possible that there are two variables highly correlated, requiring dimensionality reduction or culling processing.


Figure 4. Correlation analysis of each dimension variable

6. Model Building

The logistic regression method allows you to view previous history articles: Classic algorithm articles, which are not mentioned here. There will be such integrated tools in SAS as well. Here is an important process to illustrate:

The evidence weights (Weight of Evidence,woe) conversion can transform the logistic regression model into a standard scorecard format. The purpose of introducing woe conversion is not to improve the quality of the model, but some variables should not be included in the model, either because they cannot increase the model value, or because the error associated with the model correlation coefficient is large, in fact, the establishment of standard credit scorecard can not be used woe conversion. In this case, the logistic regression model needs to handle a larger number of arguments. Although this increases the complexity of the modeling program, the resulting scorecard is the same.

Replace the variable x with woe (x). WOE () =ln[(default/total default)/(Normal/Total normal)].



Figure 5. About woe definition and distance

In the table, age is an independent variable, because age is a continuous type of independent variable, it needs to be discretized, assuming that discretization is divided into 5 groups, #bad和 #good represents the number of default users and normal users in these five groups, the last column is the calculation of woe value, through the following changes in the formula can be seen, Woe reflects the difference in the normal user ratio between the default user and the total default user in each group under the independent variable, so it can be visualized that woe implies the influence of the self variable value on the target variable (default probability). Plus the woe calculation form is so similar to the logistic transformation (LOGIST_P=LN (P/1-P)) of the target variable in logistic regression that the independent variable woe value can be substituted for the original argument value;

Here is one more thing to add: Woe conversion IV (information value):


Figure 6. IV Formula definition

In fact, IV is a variable of the amount of information, from the formula, the equivalent of an independent variable woe value of a weighted sum, the size of its value determines the extent of the influence of the variable on the target variable; from another point of view, the IV formula is very similar to the information entropy formula. IV is an indicator that can be used to measure the predictive power of an independent variable. Similar indicators include information gain, Gini coefficients, and so on.

With the woe and IV indicators, the next model validation can be done.

7. Model Validation

When collecting data, all the collated data is divided into modeling samples for building models and control samples for model validation. The control sample is used to validate the overall predictive and stability of the model. The model test index of application scoring model includes K-s value, ROC and other indexes.

Usually a binary classifier can evaluate the merits and demerits of the ROC (Receiver Operating characteristic) curve and the AUC value.

Many two-dollar classifiers produce a probability predictor instead of just 0-1 predictions. We can use a critical point (for example, 0.5) to divide which predictions are 1 and which are 0. After getting the predicted value of two yuan, we can construct a confusion matrix to evaluate the prediction effect of the two-tuple classifier. All training data falls into this matrix, and the number on the diagonal represents the correct number of predictions, which is true positive + true nagetive. TPR (true rate or sensitivity) and TNR (true negative rate or specificity) can be calculated accordingly. We subjectively hope that these two indicators, the bigger the better, but unfortunately they are a relationship between the elimination of the other. In addition to the training parameters of the classifier, the choice of critical point will greatly affect TPR and TNR. Sometimes it is possible to choose specific tipping points based on specific problems and needs.


Figure 7. Definition of yin and yang of authenticity

If we choose a series of critical points, we will get a series of TPR and TNR, and the corresponding point of these values will be connected together to form the ROC curve. The ROC curve can help us to understand the performance of this classifier and to compare the performance of different classifiers easily. When plotting Roc curves, it is customary to use 1-TNR as the horizontal axis i.e. FPR (false positive rate), TPR as the ordinate. This is the formation of the ROC Curve.

The AUC (area under Curve) is defined as the size of the ROC curve, and it is clear that the area value will not be greater than 1. Because the ROC curve is generally above the y=x line, the AUC takes a value range between 0.5 and 1. The AUC value is used as the evaluation criterion because many times the ROC curve does not clearly explain which classifier works better, and as a numeric value, it is better to have a larger classifier for the AUC.

ROC switching curve Practical significance: measure the exchange relationship between discarding good accounts and avoiding bad accounts. The ideal scenario is to reject 100% of bad accounts without a 0%-good account, and the model distinguishes between good and bad accounts completely and accurately.


Figure 8. ROC curve in good or bad customer ratio

The k-s indicator, which is named after two mathematicians, is similar to the Exchange curve, and measures the most specific gap between the cumulative distribution of good accounts and bad households. The greater the distance between good and bad accounts, the higher the k-s indicator, the stronger the model's distinguishing ability.


Figure 9. K-s Indicator Chart: Another distinguishing mark for good and bad customers

These indicators meet after the basic completion of the scorecard model development process.

Getting Started with credit scorecard models (intelligent algorithms)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.