Spark Random forest algorithm case study

Last Update:2017-07-19 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Random Forest algorithm

A forest composed of multiple decision trees, the results of the algorithm classification by these decision trees to vote, the decision tree in the process of generating a random process in line direction and column direction, the row direction of the decision tree is built using the back sampling (bootstraping) to get training data, The basic principle of random forest algorithm is to get the feature subset by random sampling in the direction of the column, and to get the optimal segmentation point accordingly. Figure 3 shows the principle of random forest algorithm classification, we can see that the random forest is a composite model, the internal is still based on decision tree, the same as a single decision tree classification, random forest through multiple decision tree voting results are classified, the algorithm is not prone to overfitting problem.

Figure 3. Random Forest

Random Forest algorithm case

This section describes the specific applications of random forests by introducing a case study. In general, banks need to evaluate the repayment ability of customers before payment, but if the amount of customer data is large, the pressure of credit auditor will be very large, it is often hoped that the computer to make the auxiliary decision. Random forest algorithm can be used in this scenario, for example, the original historical data can be input into the random forest algorithm for data training, using the model obtained after training to classify the new customer data, so that can filter out a large number of non-repayment ability of customers, so can greatly reduce the workload of the letter auditor.

Assume that the following credit user history repayment records exist:

Table 2. Credit user History Repayment data sheet

record number	whether owning property (yes/No)	Marital status (single, married, divorced)	Annual income (unit: million)	does it have repayment capacity (yes, no)
10001	Whether	Married	10	Is
10002	Whether	Single	8	Is
10003	Is	Single	13	Is
......	....	.....	....	......
11000	Is	Single	8	Whether

The above credit user history repayment record is formatted as a label Index1:feature1 index2:feature2 index3:feature3 This format, such as the first record in the table above will be formatted as 0 1:0 2:1 3:10, each field has the following meanings:

Whether the repayment ability has the property marital status, 0 means single, annual income

0 means yes, 1 means no 0 means no, 1 means 1 means married, 2 means divorce fills in actual numbers

0 1:0 2:1 3:10

After all the data in the table is converted, it is saved as Sample_data.txt, which is used to train the random forest. The test data is:

Table 3. Test Data Sheet

whether you own the property (yes/No)	Marital status (single, married, divorced)	Annual income (unit: million yuan)
Whether	Married	12

If the random forest model training is correct, the above user data results should be capable of repayment, for the convenience of post-processing, we save it as Input.txt, the content is:

0 1:0 2:1 3:12

Upload Sample_data.txt, input.txt using Hadoop fs–put input.txt sample_data.txt/data to the/data directory in HDFS, and write the code shown in listing 9 for validation

Listing 9. Determine if the customer has repayment ability

Package cn.mlImportorg.apache.spark.SparkConfImportOrg.apache.spark.SparkContextImportorg.apache.spark.mllib.util.MLUtilsImportOrg.apache.spark.mllib.regression.LabeledPointImportOrg.apache.spark.rdd.RDDImportorg.apache.spark.mllib.tree.RandomForestImportOrg.apache.spark.mllib.tree.model.RandomForestModelImportOrg.apache.spark.mllib.linalg.Vectors Object Randomforstexample {defMain (args:array[string]) {val sparkconf= new sparkconf (). Setappname ("Randomforestexample"). Setmaster ("spark://sparkmaster:7077") Val SC=new Sparkcontext (sparkconf) Val Data:rdd[labeledpoint]= Mlutils.loadlibsvmfile (SC,"/data/sample_data.txt") Val numclasses= 2Val featuresubsetstrategy="Auto"Val numtrees= 3Val Model:randomforestmodel=randomforest.trainclassifier (data, Strategy.defaultstrategy ("Classification"), Numtrees, Featuresubsetstrategy,new java.util.Random (). Nextint ()) Val Input:rdd[labeledpoint]= Mlutils.loadlibsvmfile (SC,"/data/input.txt") Val Predictresult= Input.map {Point = =Val Prediction=model.predict (point.features) (Point.label, prediction)}Print out the results in spark-use Predictresult.collect () on shell executionSaves the results to HDFs//predictresult.saveastextfile ("/data/predictresult") Sc.stop ()}}

上述代码既可以打包后利用 spark-summit 提交到服务器上执行，也可以在 spark-shell 上执行查看结果. 图 10 给出了训练得到的Radomforest model results, Figure 11 shows the results of the Randomforest model prediction, you can see that the predicted results are consistent with the expected.

Figure 10. The Radomforest model trained to get

Figure 11. Results returned by the Collect method

Excerpt from: https://www.ibm.com/developerworks/cn/opensource/os-cn-spark-random-forest/index.html

Spark Random forest algorithm case study

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More