Random Forest algorithm
A forest composed of multiple decision trees, the results of the algorithm classification by these decision trees to vote, the decision tree in the process of generating a random process in line direction and column direction, the row direction of the decision tree is built using the back sampling (bootstraping) to get training data, The basic principle of random forest algorithm is to get the feature subset by random sampling in the direction of the column, and to get the optimal segmentation point accordingly. Figure 3 shows the principle of random forest algorithm classification, we can see that the random forest is a composite model, the internal is still based on decision tree, the same as a single decision tree classification, random forest through multiple decision tree voting results are classified, the algorithm is not prone to overfitting problem.
Figure 3. Random Forest
Random Forest algorithm case
This section describes the specific applications of random forests by introducing a case study. In general, banks need to evaluate the repayment ability of customers before payment, but if the amount of customer data is large, the pressure of credit auditor will be very large, it is often hoped that the computer to make the auxiliary decision. Random forest algorithm can be used in this scenario, for example, the original historical data can be input into the random forest algorithm for data training, using the model obtained after training to classify the new customer data, so that can filter out a large number of non-repayment ability of customers, so can greatly reduce the workload of the letter auditor.
Assume that the following credit user history repayment records exist:
Table 2. Credit user History Repayment data sheet
record number |
whether owning property (yes/No) |
Marital status (single, married, divorced) |
Annual income (unit: million) |
does it have repayment capacity (yes, no) |
10001 |
Whether |
Married |
10 |
Is |
10002 |
Whether |
Single |
8 |
Is |
10003 |
Is |
Single |
13 |
Is |
...... |
.... |
..... |
.... |
...... |
11000 |
Is |
Single |
8 |
Whether |
The above credit user history repayment record is formatted as a label Index1:feature1 index2:feature2 index3:feature3 This format, such as the first record in the table above will be formatted as 0 1:0 2:1 3:10, each field has the following meanings:
Whether the repayment ability has the property marital status, 0 means single, annual income
0 means yes, 1 means no 0 means no, 1 means 1 means married, 2 means divorce fills in actual numbers
0 1:0 2:1 3:10
After all the data in the table is converted, it is saved as Sample_data.txt, which is used to train the random forest. The test data is:
Table 3. Test Data Sheet
whether you own the property (yes/No) |
Marital status (single, married, divorced) |
Annual income (unit: million yuan) |
Whether |
Married |
12 |
If the random forest model training is correct, the above user data results should be capable of repayment, for the convenience of post-processing, we save it as Input.txt, the content is:
0 1:0 2:1 3:12
Upload Sample_data.txt, input.txt using Hadoop fs–put input.txt sample_data.txt/data to the/data directory in HDFS, and write the code shown in listing 9 for validation
Listing 9. Determine if the customer has repayment ability
Package cn.mlImportorg.apache.spark.SparkConfImportOrg.apache.spark.SparkContextImportorg.apache.spark.mllib.util.MLUtilsImportOrg.apache.spark.mllib.regression.LabeledPointImportOrg.apache.spark.rdd.RDDImportorg.apache.spark.mllib.tree.RandomForestImportOrg.apache.spark.mllib.tree.model.RandomForestModelImportOrg.apache.spark.mllib.linalg.Vectors Object Randomforstexample {defMain (args:array[string]) {val sparkconf= new sparkconf (). Setappname ("Randomforestexample"). Setmaster ("spark://sparkmaster:7077") Val SC=new Sparkcontext (sparkconf) Val Data:rdd[labeledpoint]= Mlutils.loadlibsvmfile (SC,"/data/sample_data.txt") Val numclasses= 2Val featuresubsetstrategy="Auto"Val numtrees= 3Val Model:randomforestmodel=randomforest.trainclassifier (data, Strategy.defaultstrategy ("Classification"), Numtrees, Featuresubsetstrategy,new java.util.Random (). Nextint ()) Val Input:rdd[labeledpoint]= Mlutils.loadlibsvmfile (SC,"/data/input.txt") Val Predictresult= Input.map {Point = =Val Prediction=model.predict (point.features) (Point.label, prediction)}Print out the results in spark-use Predictresult.collect () on shell executionSaves the results to HDFs//predictresult.saveastextfile ("/data/predictresult") Sc.stop ()}}
上述代码既可以打包后利用 spark-summit 提交到服务器上执行,也可以在 spark-shell 上执行查看结果. 图 10 给出了训练得到的
Radomforest model results, Figure 11 shows the results of the Randomforest model prediction, you can see that the predicted results are consistent with the expected.
Figure 10. The Radomforest model trained to get
Figure 11. Results returned by the Collect method
Excerpt from: https://www.ibm.com/developerworks/cn/opensource/os-cn-spark-random-forest/index.html
Spark Random forest algorithm case study