Spark Random forest algorithm case study

Source: Internet
Author: User
Tags hadoop fs

Random Forest algorithm

A forest composed of multiple decision trees, the results of the algorithm classification by these decision trees to vote, the decision tree in the process of generating a random process in line direction and column direction, the row direction of the decision tree is built using the back sampling (bootstraping) to get training data, The basic principle of random forest algorithm is to get the feature subset by random sampling in the direction of the column, and to get the optimal segmentation point accordingly. Figure 3 shows the principle of random forest algorithm classification, we can see that the random forest is a composite model, the internal is still based on decision tree, the same as a single decision tree classification, random forest through multiple decision tree voting results are classified, the algorithm is not prone to overfitting problem.

Figure 3. Random Forest

Random Forest algorithm case

This section describes the specific applications of random forests by introducing a case study. In general, banks need to evaluate the repayment ability of customers before payment, but if the amount of customer data is large, the pressure of credit auditor will be very large, it is often hoped that the computer to make the auxiliary decision. Random forest algorithm can be used in this scenario, for example, the original historical data can be input into the random forest algorithm for data training, using the model obtained after training to classify the new customer data, so that can filter out a large number of non-repayment ability of customers, so can greatly reduce the workload of the letter auditor.

Assume that the following credit user history repayment records exist:

Table 2. Credit user History Repayment data sheet
record number whether owning property (yes/No) Marital status (single, married, divorced) Annual income (unit: million) does it have repayment capacity (yes, no)
10001 Whether Married 10 Is
10002 Whether Single 8 Is
10003 Is Single 13 Is
...... .... ..... .... ......
11000 Is Single 8 Whether

The above credit user history repayment record is formatted as a label Index1:feature1 index2:feature2 index3:feature3 This format, such as the first record in the table above will be formatted as 0 1:0 2:1 3:10, each field has the following meanings:

Whether the repayment ability has the property marital status, 0 means single, annual income

0 means yes, 1 means no 0 means no, 1 means 1 means married, 2 means divorce fills in actual numbers

0 1:0 2:1 3:10

After all the data in the table is converted, it is saved as Sample_data.txt, which is used to train the random forest. The test data is:

Table 3. Test Data Sheet
whether you own the property (yes/No) Marital status (single, married, divorced) Annual income (unit: million yuan)
Whether Married 12

If the random forest model training is correct, the above user data results should be capable of repayment, for the convenience of post-processing, we save it as Input.txt, the content is:

0 1:0 2:1 3:12

Upload Sample_data.txt, input.txt using Hadoop fs–put input.txt sample_data.txt/data to the/data directory in HDFS, and write the code shown in listing 9 for validation

Listing 9. Determine if the customer has repayment ability
Package cn.mlImportorg.apache.spark.SparkConfImportOrg.apache.spark.SparkContextImportorg.apache.spark.mllib.util.MLUtilsImportOrg.apache.spark.mllib.regression.LabeledPointImportOrg.apache.spark.rdd.RDDImportorg.apache.spark.mllib.tree.RandomForestImportOrg.apache.spark.mllib.tree.model.RandomForestModelImportOrg.apache.spark.mllib.linalg.Vectors Object Randomforstexample {defMain (args:array[string]) {val sparkconf= new sparkconf (). Setappname ("Randomforestexample"). Setmaster ("spark://sparkmaster:7077") Val SC=new Sparkcontext (sparkconf) Val Data:rdd[labeledpoint]= Mlutils.loadlibsvmfile (SC,"/data/sample_data.txt") Val numclasses= 2Val featuresubsetstrategy="Auto"Val numtrees= 3Val Model:randomforestmodel=randomforest.trainclassifier (data, Strategy.defaultstrategy ("Classification"), Numtrees, Featuresubsetstrategy,new java.util.Random (). Nextint ()) Val Input:rdd[labeledpoint]= Mlutils.loadlibsvmfile (SC,"/data/input.txt") Val Predictresult= Input.map {Point = =Val Prediction=model.predict (point.features) (Point.label, prediction)}Print out the results in spark-use Predictresult.collect () on shell executionSaves the results to HDFs//predictresult.saveastextfile ("/data/predictresult") Sc.stop ()}}

上述代码既可以打包后利用 spark-summit 提交到服务器上执行,也可以在 spark-shell 上执行查看结果. 图 10 给出了训练得到的Radomforest model results, Figure 11 shows the results of the Randomforest model prediction, you can see that the predicted results are consistent with the expected.

Figure 10. The Radomforest model trained to get

Figure 11. Results returned by the Collect method

Excerpt from: https://www.ibm.com/developerworks/cn/opensource/os-cn-spark-random-forest/index.html

Spark Random forest algorithm case study

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.