Spark Machine Learning (6): Decision Tree algorithm

Last Update:2017-07-06 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Basic knowledge of decision trees

Decision tree is an algorithm that classifies data by a series of rules, which can be divided into classification tree and regression tree, the classification tree deals with discrete variables, and the regression tree is the processing continuous variable.

Samples generally have many characteristics, some of the characteristics of the classification plays a large role, and some characteristics of the role of the classification is very small, even no effect. If deciding whether to loan to a person is, this person's credit record, income and so on is the main judgment basis, but the gender, the marital status and so on is the secondary judgment basis. Decision tree construction process is based on the decisive degree of the characteristics of the first use of high-resolution feature classification, and then use a low-critical feature classification, so that the construction of an inverted tree, we need a decision tree model, can be used to classify the data.

The process of decision tree learning can be divided into three steps: 1) Feature selection, i.e. selecting a classification standard from many features as the current node, 2) Decision tree generation, building nodes from top to bottom, 3 pruning, pruning of decision trees in order to prevent and eliminate overfitting.

2. Decision Tree Algorithm

The main decision tree algorithms include ID3, C4.5, and cart.

ID3 the information gain as a criterion for selecting features. Because the information gain of the characteristic with more value (such as the number) is larger, the algorithm will be inclined to the characteristic of more value. And the algorithm can only be used for discrete data, the advantage is that there is no need to prune.

C4.5 and ID3 are similar, the difference is that the use of information gain than alternative information gain as a criterion for selecting features, so more scientific than ID3, and can be used for continuous data, but need pruning.

The CART (classification and Regression Tree) uses the Gini as the standard of choice. The bigger the Gini, the greater the purity, the worse the trait.

3. Mllib's decision Tree algorithm

Mllib's decision Tree algorithm uses the method of random forest randomforest, but is not really a random forest, because there is actually only one decision tree.

Directly on the code:

Importorg.apache.log4j. {level, Logger}ImportOrg.apache.spark. {sparkconf, sparkcontext}ImportOrg.apache.spark.mllib.tree.DecisionTreeImportOrg.apache.spark.mllib.tree.model.DecisionTreeModelImportorg.apache.spark.mllib.util.MLUtils/*** Created by Administrator on 2017/7/6. */Object Decisiontreetest {def main (args:array[string]): Unit= {    //setting up the operating environmentVal conf =NewSparkconf (). Setappname ("decision Tree"). Setmaster ("spark://master:7077"). Setjars (Seq ("E:\\intellij\\projects\\machinelearning\\machinelearning.jar"))) Val SC=Newsparkcontext (conf) Logger.getRootLogger.setLevel (Level.warn)//read sample data and parseVal Datardd = Mlutils.loadlibsvmfile (SC, "Hdfs://master:9000/ml/data/sample_dt_data.txt")    //Sample Data division, training samples accounted for 0.8, test samples accounted for 0.2Val dataparts = Datardd.randomsplit (Array (0.8, 0.2)) Val Trainrdd= Dataparts (0) Val Testrdd= Dataparts (1)    //Decision Tree ParametersVal numclasses = 5Val Categoricalfeaturesinfo=Map[int, Int] () Val impurity= "Gini"Val maxDepth= 5Val maxbins= 32//Build a decision tree model and trainVal model =Decisiontree.trainclassifier (Trainrdd, numclasses, Categoricalfeaturesinfo, impurity, maxDepth, MaxBins) //Test the test sampleVal Predictionandlabel = Testrdd.map {point = =Val Score=model.predict (Point.features) (Score, Point.label, Point.features)} Val showpredict= Predictionandlabel.take (50) println ("Prediction" + "\ T" + "Label" + "\ T" + "Data")     for(I <-0 to Showpredict.length-1) {println (Showpredict (i). _1+ "\ T" + showpredict (i). _2 + "\ T" +showpredict (i). _3)} //Error CalculationVal Accuracy = 1.0 * Predictionandlabel.filter (x = x._1 = = x._2). Count ()/Testrdd.count () println ("Accuracy =" +accuracy)//Save the ModelVal Modelpath = "Hdfs://master:9000/ml/model/decision_tree_model"Model.save (SC, Modelpath) Val Samemodel=Decisiontreemodel.load (SC, modelpath)}

Operation Result:

Spark Machine Learning (6): Decision Tree algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More