Spark Machine Learning (6): Decision Tree algorithm

Source: Internet
Author: User
Tags id3

1. Basic knowledge of decision trees

Decision tree is an algorithm that classifies data by a series of rules, which can be divided into classification tree and regression tree, the classification tree deals with discrete variables, and the regression tree is the processing continuous variable.

Samples generally have many characteristics, some of the characteristics of the classification plays a large role, and some characteristics of the role of the classification is very small, even no effect. If deciding whether to loan to a person is, this person's credit record, income and so on is the main judgment basis, but the gender, the marital status and so on is the secondary judgment basis. Decision tree construction process is based on the decisive degree of the characteristics of the first use of high-resolution feature classification, and then use a low-critical feature classification, so that the construction of an inverted tree, we need a decision tree model, can be used to classify the data.

The process of decision tree learning can be divided into three steps: 1) Feature selection, i.e. selecting a classification standard from many features as the current node, 2) Decision tree generation, building nodes from top to bottom, 3 pruning, pruning of decision trees in order to prevent and eliminate overfitting.

2. Decision Tree Algorithm

The main decision tree algorithms include ID3, C4.5, and cart.

ID3 the information gain as a criterion for selecting features. Because the information gain of the characteristic with more value (such as the number) is larger, the algorithm will be inclined to the characteristic of more value. And the algorithm can only be used for discrete data, the advantage is that there is no need to prune.

C4.5 and ID3 are similar, the difference is that the use of information gain than alternative information gain as a criterion for selecting features, so more scientific than ID3, and can be used for continuous data, but need pruning.

The CART (classification and Regression Tree) uses the Gini as the standard of choice. The bigger the Gini, the greater the purity, the worse the trait.

3. Mllib's decision Tree algorithm

Mllib's decision Tree algorithm uses the method of random forest randomforest, but is not really a random forest, because there is actually only one decision tree.

Directly on the code:

Importorg.apache.log4j. {level, Logger}ImportOrg.apache.spark. {sparkconf, sparkcontext}ImportOrg.apache.spark.mllib.tree.DecisionTreeImportOrg.apache.spark.mllib.tree.model.DecisionTreeModelImportorg.apache.spark.mllib.util.MLUtils/*** Created by Administrator on 2017/7/6. */Object Decisiontreetest {def main (args:array[string]): Unit= {    //setting up the operating environmentVal conf =NewSparkconf (). Setappname ("decision Tree"). Setmaster ("spark://master:7077"). Setjars (Seq ("E:\\intellij\\projects\\machinelearning\\machinelearning.jar"))) Val SC=Newsparkcontext (conf) Logger.getRootLogger.setLevel (Level.warn)//read sample data and parseVal Datardd = Mlutils.loadlibsvmfile (SC, "Hdfs://master:9000/ml/data/sample_dt_data.txt")    //Sample Data division, training samples accounted for 0.8, test samples accounted for 0.2Val dataparts = Datardd.randomsplit (Array (0.8, 0.2)) Val Trainrdd= Dataparts (0) Val Testrdd= Dataparts (1)    //Decision Tree ParametersVal numclasses = 5Val Categoricalfeaturesinfo=Map[int, Int] () Val impurity= "Gini"Val maxDepth= 5Val maxbins= 32//Build a decision tree model and trainVal model =Decisiontree.trainclassifier (Trainrdd, numclasses, Categoricalfeaturesinfo, impurity, maxDepth, MaxBins) //Test the test sampleVal Predictionandlabel = Testrdd.map {point = =Val Score=model.predict (Point.features) (Score, Point.label, Point.features)} Val showpredict= Predictionandlabel.take (50) println ("Prediction" + "\ T" + "Label" + "\ T" + "Data")     for(I <-0 to Showpredict.length-1) {println (Showpredict (i). _1+ "\ T" + showpredict (i). _2 + "\ T" +showpredict (i). _3)} //Error CalculationVal Accuracy = 1.0 * Predictionandlabel.filter (x = x._1 = = x._2). Count ()/Testrdd.count () println ("Accuracy =" +accuracy)//Save the ModelVal Modelpath = "Hdfs://master:9000/ml/model/decision_tree_model"Model.save (SC, Modelpath) Val Samemodel=Decisiontreemodel.load (SC, modelpath)}

Operation Result:

Spark Machine Learning (6): Decision Tree algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.