Python rating Card

Source: Internet
Author: User

1. Project Background Introduction 1.1 Basic concepts of credit risk and scorecard model

Credit risk refers to the risk that the counterparty fails to perform the economic loss due to the obligation of the contract, that is, the possibility that the trusted person can deviate from the actual income and the expected return of the credit, which is the main type of financial risk.

The scorecard in the loan scenario is a measure of the risk probability in the form of fractions, and is also a prediction of default, overdue, and missed probability in the coming period. In general, the higher the score, the less risk.

The credit risk measurement system includes two parts: the main rating model and the debt rating. The main rating model includes the following four aspects:

    • Applicant Rating Model: Mainly used in the related financing business in the new users of the main rating, applicable to individual and institutional financing subject. In the pre-loan access link.
    • Behavioral Rating Model: It is mainly used for the management of the stock customers in the related financing business in the period of renewal, such as predicting the overdue and delaying behaviors of the clients, which is only applicable to the personal financing subject.
    • Collection Rating Model: mainly used in related financing business in the stock of customers need to collect the forecast management, only applicable to the individual financing subject.
    • Fraud Rating Model: It is mainly applied to the predictive management of potential fraudulent behaviors of new clients in related financing business, and is applicable to the individual and institutional financing subjects. In the pre-loan access link.

This project focuses on the applicant scoring model.

1.2 Data sources

This project data is derived from Kaggle contest give Me Some credit.

2. Credit card scoring model development

The construction of the model mainly consists of the following parts: Data preparation and data preprocessing, variable selection, model building, model verification, model evaluation, model deployment, model monitoring, and so on. The following one by one describes.

2.1 Data preparation and data preprocessing 2.1.1 Get data

Data acquisition includes the inventory of customers, including the acquisition of data from customers and potential customers. The stock customer refers to the customers who have already carried out the related financing business in the securities company, including the individual customer and the institutional customer, the prospective customer is the prospective client who intends to carry on the related financing business in the securities company, mainly including the institutional customers, which is also a common method to solve the stock industry sample less, these potential institutional clients include listed companies, The issuer of the issuing bonds, the new Sanbanxi listed companies, the regional equity trading centers listed companies, non-standard financing institutions.

First, we will look at the existing data and indicators.

This project data is derived from Kaggle contest give Me Some credit. Of these training data amounted to 150,000 articles.

Variable name

Describe

Type

Label

seriousdlqin2yrs

Overdue arrears of more than 90 days or worse

y/n

Y

Revolvingutilizationofunsecuredlines

Ratio of loan and credit card availability to total amount

Percentage

X0

Age

The borrower's age at that time

Integral type

X1

Numberoftime30-59dayspastduenotworse

35-59 days overdue but not bad times

Integral type

X2

Debtratio

Debt ratio

Percentage

X3

Monthlyincome

Monthly income

Real

X4

Numberofopencreditlinesandloans

Number of open credit and loans, open-ended loans (installment payments such as car loans or mortgages) and credit (e.g. credit cards)

Integral type

X5

Numberoftimes90dayslate

90 days overdue: The borrower has 90 days or higher the number of overdue

Integral type

X6

Numberrealestateloansorlines

Real estate loan or quota: mortgages and real estate loans including home equity credit lines

Integral type

X7

Numberoftime60-89dayspastduenotworse

60-89 days overdue but not bad: borrowers in the past two years have 60-89 days overdue repayment but not bad number of times

Integral type

x8

Numberofdependents

Number of dependents: number of dependents excluding myself

Integral type

X9

2.1.2 Data preprocessing

The describe () function in Python allows you to see the general picture of the whole piece of data:

Such as:

Figure 2.1

A. Missing value Processing

If there are missing values in the variable, there are four cases for missing values:

    • The default value is very high: if the default value of the sample accounted for the highest proportion of the total, directly discard, because as a feature addition will introduce a noise value.
    • Non-continuous feature default is moderate: If a sample of missing values is moderate, and the attribute is a discontinuous value feature attribute, Nan is added to the class feature as a new category.
    • The default value for continuous features is moderate: If a sample of missing values is moderate, consider giving a step, then discretization, and adding Nan as a type to the attribute class.
    • The default value is less: Consider using the Fill method for processing. There are mean, majority, median fill, using the Randomforest model in Sklearn to fit the Data sample training model, and then to fill the missing value; Lagrange interpolation method.

It can be seen that monthlyincome (monthly income) and numberofdependents (number of dependents) Two variables a missing value occurred. Because the monthlyincome missing value reaches 29,731 data, the proportion is large, therefore cannot delete the missing value directly, chooses the random forest law. The numberofdependents is less, and has little effect on the data, so it is deleted directly.

Use the Dropna () function to remove the null value and use Drop_duplicates () to remove the duplicate value.

B. Outlier Handling

After the missing values have been processed, we also need to handle outliers. Outliers are values that deviate significantly from most sampled data, such as an individual customer's age greater than 100 or less than 0 o'clock, which is generally considered an outlier. Find outliers in the sample population, usually using outlier detection methods. Outlier detection methods include single-variable outlier detection, local outlier factor detection, and outlier detection based on clustering method.

In this data set, we use the single-variable outlier detection to judge outliers and use the box-line diagram.

For the age variable, we think that more than 100 years old is less than or equal to 0 years old for outliers, by the box line graph, outliers sample is not many, so directly deleted.

Figure 2.2

For Revolvingutilizationofunsecuredlines (available amount ratio) and debtratio (debt rate), the box line chart is as follows:

Figure 2.3

Because the values of the above two variables are percentages, the value greater than 1 is removed.

For variable x2 (30-59 days overdue), X6 (90 days overdue), x8 (60-89 days overdue for the box line chart, by the figure, there are two outliers, the value is 96, 98, delete.

Figure 2.4

2.2 Explore Data

The main purpose of the exploration data is to analyze the influence of the variables on the output, and in this project, the main concern is the relationship between the default client and the variables.

2.2.1 Single variable Analysis

First of all, to observe the overall situation of good or bad customers.

Figure 2.5

Then look at the impact of age on the customer rate of default, from the known, the default customer rate with the increase in age and gradually decline.

Figure 2.6

Now to analyze the impact of monthly revenue on the number of defaulting clients, the monthly income is divided into the following levels: [0,5000],[5000,10000], [10000,15000],[15000,20000],[20000,100000]. By the following two graphs, Before 20000 revenue increases with income, the default customer rate decreases, and the monthly income is more than 20000, the default customer rate increases with the increase of income.

Figure 2.7

Figure 2.8

Next we analyze the relationship between the numberofdependents (number of dependents) and the final output. It can be seen that with the increase of family population, the rate of default customers is increasing.

Figure 2.9

Figure 2.10

Observing the relationship between the number of overdue 30-59 days and the default customer rate, we can see that with the increase of the number of defaults, the default customer rate is increasing.

Figure 2.11

Univariate analysis of data is temporarily processed here, and the remaining variables are analyzed in the same way.

2.2.2 Multivariate analysis

Multivariate analysis is primarily used to analyze the degree of correlation between variables, and Python can use the Corr () function to calculate correlations between variables. It is known from Figure 11 that the correlations between the variables are small, and there is no collinearity possible.

Figure 2.12

3. Build Model 3.1 Feature selection

Feature selection is very important, good features can construct a better model, as for Python's variable selection code implementation can refer to the combination of Scikit-learn introduced several commonly used feature selection methods. Here, we use the credit card scoring model commonly used in the IV value screening. The specific IV values and woe calculation methods are explained in detail in the following blog post, which is not discussed here.

3.1.1 Characteristic sub-box

The characteristic sub-box refers to the discretization of continuous variables or the merging of discrete variables of multiple states into less states. The increment and decrease of discrete features are easy, easy to iterate the model, the characteristic of discretization is very robust to the anomaly data, it can reduce the disturbance of outliers to the model before discretization, and can carry out feature crossover after discretization. In addition, the model algorithm selected in this paper is logistic regression, the logistic regression belongs to the generalized linear model, the expression ability is limited, and the single variable is discretized to n, each variable has a separate weight, which is equivalent to introducing nonlinearity to the model, enhancing the expression ability of the model, increasing the fitting, and reducing the risk of the model overfitting.

Features are commonly used in the following methods: There are supervised Best-ks,chimerge (Card sub-box), non-supervised including equal frequency, equidistant, clustering. According to the data characteristics, different bins are used for different data. The code is as follows:

3.1.2 Woe value calculation

Define the woe value and evaluate it.

3.1.3 Calculating the IV value

The full name of IV is information value, which means the value of information, or the amount of data. Figure 13 is the IV value for each variable. We define a feature with an IV value below 0.2 as a less predictive or irrelevant feature, so five variables such as Debtratio are removed.

Figure 3.1

3.2 Construction Model 3.2.1 Woe value substitution

The woe transformation of the filtered feature variables is performed to reduce the amount of the self-variable processing of the logistic regression.

3.2.2 LR Modeling

Use the Logisticregression () in the Linear_model in Sklearn to build the model.

4. Model Checking

Test with a test set. See Roc such as:

Figure 4.1

From the conclusion, the AUC value is 0.85, indicating that the prediction effect of the model is good, the correct rate is high.

5. Credit Score

The output of the above model can be for each customer's customer category (good customer or default customer), can also say each customer is good customer/default customer corresponding probability value. This result does not look intuitive and we need to convert the resulting value of the LR model to the corresponding score (0-999 points).

According to the information, you can find:

A=log (P_good/p_bad)

Score = offset + factor * LOG (odds)

It is concluded that the corresponding feature grouping scores are:

Figure 5.1

On the base score, the corresponding grouping score is added and reduced, the corresponding score is obtained, the higher the score, the higher the risk of default.

6. Summary

Based on the mining analysis of Give Me Some Credit data on Kaggle, this paper combines the principle of the establishment of credits, from data preprocessing, variable selection, modeling analysis to creating credit score, to create a simple credit scoring system. The project also has many shortcomings, such as the sub-box should use the best sub-box or chi-square box, reduce the randomness of the human sub-box, in addition to the model is a logical regression algorithm, you can also try a lot of other models.

Python rating Card

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.