SummaryThis sample demonstrates how to perform cost-sensitive binary classification in Azure ML Studio to predict credits risk BAS Ed on information given in a credit application.
Description
Binary classification:credit Risk Prediction
This sample demonstrates how to perform cost-sensitive binary classification in Azure ML Studio to predict credits risk BAS Ed on the information given in a credit application. The classification problem in this experiment are a cost-sensitive one because the cost of misclassifying the positive Samp Les is five times the cost of misclassifying the negative samples.
In this experiment, we compare-different approaches for generating models-solve this problem:training using the OR Iginal data set Training using a replicated data set
In both approaches, we evaluate the models using the test data set with replication, to ensure that results is aligned WI Th the cost function. We test the classifiers in both Approaches:two-class support Vector machine and Two-class Boosted decision Tree. Data
We Use the German credits Card data set from the UC Irvine repository.
This data set contains the samples with features and 1 label. Each sample is represents a person. The features include both numerical and categorical features. The last column was the label, which denotes the credit risk and had only a possible values:high credit risk = 2, and lo W Credit risk = 1.
The cost of misclassifying a low risk example as was 1, whereas the cost of misclassifying a high risk example as low is 5.
Data Processing
We started by using the Metadata Editor module to add column names to replace the default column names with more Meaningfu L names, obtained from the data set description on the UCI site. The new column names is provided as comma-separated values in the new Column Name field of Metadata Editor.
Next, we generated training and test sets used for developing the risk prediction model. We split the original data set into training and test sets of the same size using the split module. To create sets of equal size, we set the option, fraction of rows in the first output, to 0.5. generating the New Data Set
Because the cost of underestimating risk are high in the real world, we set the cost of misclassification as Follows:for H IgH risk cases misclassified as low risk:5 for low risk cases misclassified as high risk:1
To reflect the cost function, we generated a new data set, in which each high risk example is replicated five times, wher EAS the number of low risk examples was kept as is. We split the data into training and test data sets before replication to prevent the same example from being in both the T Raining and test sets.
To replicate the high risk data, we put the following r code into an Execute R Script module:
DataSet <-Maml.mapinputport (1)
data.set <-dataset[dataset[,21]==1,]
pos <-dataset[dataset[,21]== 2,] for
(i in 1:5) data.set <-rbind (data.set,pos)
row.names (data.set) <-NULL
maml.mapoutputport (" Data.set ")
Both the training and test data sets are replicated using the Execute R Script module.
Finally, we used the descriptive Statistics module to compute Statistics for all fields of the input data. Feature Engineering
One of the machine learning algorithms requires, that data is normalized. Therefore, we used thenormalize Data module to normalize the ranges of all numeric features, using a tanh transformation. A Tanh transformation converts all numeric features to values within a range of 0-1, while preserving the overall Distribu tion of values.
The Two-class support Vector machine module handles string features for us, converting them to categorical features and th En to binary features have a value of 0 or 1, so there are no need to normalize these features. Model
In this experiment, we applied-Classifiers:two-class support Vector Machine (SVM) and Two-class Boosted decision Tree . Because we also used-datasets, we generated a total of four MODELS:SVM, trained with original data SVM, trained with Replicated data Boosted decision tree, trained with original data Boosted decision Tree, trained with replicated data
We used the standard experimental workflow to create, train, and test the models:initialize the learning algorithms, usin G Two-class support Vector machine and Two-class Boosted decision Tree use Train Model to apply the algorithm to the data and create the actual model. Use score Model to produce scores using the test examples.
The following diagram shows a portion of this experiment, in which the original and replicated training sets is used to t Rain the different SVM models. Train model is connected to the training set, and Whereasscore model is connected to the test set.
In the evaluation stage of the experiment, we computed the accuracy of the four models. For this experiment, we've used Evaluate Model to compare examples that has the same misclassification cost.
The Evaluate Model module can compute the performance metrics for up to both scored models. Therefore, we used one instance of Evaluate model to Evaluate the both SVM models, and another instance of Evaluate Model T o Evaluate the boosted decision tree models.
Notice the replicated test data set is used as the input for score Model. In other words, the final accuracy scores include the cost for getting the labels wrong. Combine multiple Results
The Evaluate Model module produces a table with a single row that contains various metrics. To create a single set of accuracy results, we first used ADD Rows to combine the results into a single table, and then US Ed the following simple R script in the Execute R script module to add the model name and training approach for each row I n the table of results.
DataSet <-Maml.mapinputport (1)
a <-matrix (C ("SVM", "weighted",
"SVM", "unweighted",
"Boosted Decision tree "," weighted ",
" Boosted decision Tree "," unweighted "), nrow=4,ncol=2,byrow=t
)
Data.set < -Cbind (A,dataset)
names (Data.set) [1:2] <-C ("algorithm", "Training")
maml.mapoutputport ("Data.set")
Finally we removed the columns with non-relevant metrics using the Project columns module. Results
To view the final results of the experiment, you can right-click the visualize output of the last Project Columns module. The first column lists the machine learning algorithm used to generate a model. The second column indicates the type of the training set. The third column contains the cost-sensitive accuracy value.
From these results, you can see that the best accuracy are provided by the model that's created Usingtwo-class support Ve ctor machine and trained on the replicated training data set.