Prediction of forest vegetation by decision tree algorithm

Last Update:2016-05-13 Source: Internet

Author: User

Tags spark mllib

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Algorithms introduce regression and classification

Regression algorithms and classification algorithms are often associated because both can predict one or more values by one or more values he
To be able to make predictions, both need to learn the predictive rules from a set of inputs and outputs and need to tell them questions and answers to the questions during the learning process.
Therefore, regression and classification belong to the supervised learning class algorithm.

Regression is the prediction of a numerical result, such as temperature, grade, etc.
Classification is the prediction of a label or category, such as whether the mail is spicy chicken mail, a person belongs to which race

This will be done using decision Trees and random forest algorithms , both of which are flexible and widely used
Can be used for classification problems and regression problems

Feature recognition

Let's say we're going to predict tomorrow's sales by the sales of a restaurant business today, using the machine learning algorithm is perfectly fine.
But before we do something, today's sales are a very broad concept that cannot be applied directly to the algorithm
Then we need to feature extraction and identification of today's sales , which helps us forecast sales for tomorrow
For example:

Min. Traffic today: 20

Today's maximum traffic: 200

Today average serving speed: 3

Busy is morning, noon or evening: Noon

Average person flow in the hotel next door: 60

And so on, features are sometimes called dimensions or predictive indicators , and each of these features can be quantified, such as
Today's minimum flow rate is 20, unit people
Today the average serving speed is 3, per minute

As a result, today's sales can be reduced to a list of values:
20,200,3, noon, 60

These characteristics are arranged in order to form a eigenvector, a eigenvector can be used to describe the merchant's daily sales

about how to extract features, according to the actual business scenario to operate, the extracted features can be aptly described the specific business scenarios, the process even requires the involvement of business experts

I believe you've noticed that not every value in the eigenvectors in the example is numeric, such as noon
In practical applications, this is common and usually divides the features into two types:

Numerical characteristics: Can be directly used to indicate that the number of small size is meaningful

Category feature: Choose one of several discrete values, the size between discrete values is meaningless

Training samples

To make predictions, supervised learning algorithms need to be trained on a large number of datasets that contain not only the data feature input but also the correct input
The eigenvector in the example above cannot be used as a training sample because it does not have the correct output-today's sales
If you add 2000 of today's sales as a feature, this data can be used as a training sample to provide a structured input to the machine learning algorithm:
20,200,3, noon, 60,2000.

Note that the output here is a numeric type
The difference between regression and classification is that the goal of regression is a numerical feature, and the target of classification is the category type feature .
Not all regression and classification algorithms are capable of dealing with categorical or categorical objects, and some algorithms can only handle numerical types

We can convert categorical features to numeric features with appropriate conversion rules, such as:

Morning: 1

Noon: 2

Evening: 3

The number 123 represents the morning, noon, and evening respectively, so the training sample can be expressed as:
20,200,3,2,60,2000

Note: Some algorithms speculate on the size of a numeric feature, which can have some impact once the class feature is converted to a numeric feature, because the size of the class feature itself is meaningless

Decision Tree

The decision tree algorithm family can naturally handle the implementation and classification characteristics of the Department
What is a decision tree?
In fact, we inadvertently use the reasoning methods embodied in decision trees in our lives, such as:
See a satisfied new phone, but the old phone can still use, buy or not buy it?
So I'll go through the process:

1. Is the old phone unbearable? Yes, just buy
2. The previous step is no, then whether the new phone can reach the point of non-buy it? Yes, just buy
3. The previous step is no, is the price of the new mobile phone reach the point where you can buy it at will? Yes, just buy
4. The previous step is no, the comprehensive price/performance ratio of the new phone can be a distance away from the old phone? Yes, just buy
5. The previous step is no, then do not buy

It can be seen that the process of decision tree reasoning is a series of Yes/No, in the program is expressed as a bunch of if/else
But the decision tree has a very serious drawback over-fitting , how to understand?
That is, models trained with training data perform well in training data, but cannot make reasonable predictions about new data
A model trained by a decision tree may have an over-fitting problem with training data

Random Forest

Random forest is composed of several random decision Trees, what is a random decision tree?
In each decision tree in a random forest, the training data used is a random part of the total training data, and the decision rules used at each level are randomly selected
The random forest approach is collective wisdom , and the collective average prediction should be more accurate than the predictions of any individual
This independence is due to randomness in random forest construction, which is the key to random forest

Because each decision tree uses a random number of decisions, so random forests have enough time to build multiple decision trees
Because of this, every tree in the forest does not produce overfitting, because the decisions used are part of a random

Stochastic forest prediction is a weighted average of all decision trees

Program Development Data Set

The sample will use Covtype's forest vegetation dataset, which is public and can be downloaded online
Each sample of the data set describes a number of characteristics of each land, including 54 features such as altitude, slope, distance to water, shading and soil type, and gives target characteristics – the type of forest vegetation for each piece of land

The decision tree for Spark Mllib

Here you will use the Mllib machine learning Library in Spark, mllib to abstract the eigenvector into Labeledpoint, for a specific introduction to Labeledpoint:
Spark (11) –mllib API Programming Linear regression, Kmeans, collaborative filtering demo

The vector in Labeledpoint is essentially an abstraction of multiple double types, so labeledpoint can only handle numeric features, and the sample dataset has converted the class-type feature to a numeric feature.
The input of DecisionTree in Mllib is the Labeledpoint type, so after reading the data we need to convert the data to Labeledpoint

Valconf =NewSparkconf (). Setappname ("DecisionTree")Valsc =NewSparkcontext (CONF)//Read DataValRawData = Sc.textfile ("/spark_data/covtype.data")//Convert to LabeledpointValdata = Rawdata.map {line = =ValValues = Line.split (","). Map (_.todouble)//init returns all elements except the last element, as eigenvectors  ValFeature = Vectors.dense (Values.init)//Returns the last target feature, since the target feature of the decision tree is set to start at 0, and the data starts at 1, so 1  ValLabel = Values.last-1Labeledpoint (label, Feature)}

To be able to evaluate the accuracy of the trained model, we can divide the data set into three parts: the training set, the cross-validation set, and the test set, each accounting for 80%, 10%, and 10%

val Array(trainData, cvData, testData) = data.randomSplit(Array(0.80.10.1))trainData.cache()cvData.cache()testData.cache()

Let's start by training a decision tree model to see what the pig looks like.

/** * Get evaluation indicators * * @param Model Decision Tree Models * @param data set for cross-validation **/defGetmetrics (Model:decisiontreemodel, Data:rdd[labeledpoint]): Multiclassmetrics = {//The eigenvector of each sample of the cross-validation data set is given to the model prediction and a tuple is formed with the original correct target feature  ValPredictionsandlables = Data.map {d = (model.predict (d.features), D.label)}//give the results to multiclassmetrics, which can calculate the quality of the dispenser's predictions in different ways  NewMulticlassmetrics (Predictionsandlables)}ValModel = Decisiontree.trainclassifier (Traindata,7, Map[int, Int] (),"Gini",4, -)ValMetrics = Getmetrics (model, Cvdata)

DecisionTree has two methods of training the model: Trainclassifier and Trainregressor respectively, corresponding to the classification and regression problems.
Parameters of the Trainclassifier:

1. Training Data Set
2. Number of target categories, i.e. there are several options for results
The key values in 3.Map correspond to the value of the vector subscript and the corresponding category of the subscript, and Null indicates that all features are numeric (for convenience, the example is directly empty, which is not used in practice)
4. Impurity (impurity) measures: Gini or entropy, the purity is used to measure the quality of a rule, good rules can divide the data into two parts of the equivalent, the bad rule is the opposite
5. The maximum depth of the decision tree, the deeper the decision tree is more likely to produce over-fitting problems
6. The maximum number of buckets in a decision tree, the number of decision rules used per layer, the more likely to be accurate, the more time spent, the smallest bucket should not be less than the maximum number of choices in the category feature

Now let's look at the confusion matrix in the metrics indicator:

System.out.println(metrics.confusionMatrix.toString())

The result is:

14411.0  6564.0   17.0    1.0    0.0   0.0  317.05444.0   22158.0  449.0   22.0   4.0   0.0  43.00.0      415.0    3022.0  95.0   0.0   0.0  0.00.0      0.0      159.0   112.0  0.0   0.0  0.00.0      895.0    34.0    0.0    14.0  0.0  0.00.0      422.0    1228.0  108.0  0.0   0.0  0.01112.0   28.0     0.0     0.0    0.0   0.0  903.0

Because there are 7 target categories, the confusion matrix is a 7*7 matrix, each row corresponds to a correct target eigenvalue , and each column corresponds to a predicted target eigenvalue
Row i, Column J, indicates the number of times that a sample of the correct class I is predicted to be J, so the elements on the diagonal represent the correct number of times, and others indicate the number of errors predicted
In addition, the accuracy of the model can be represented by a number:

System.out.println(metrics.precision)

The result is 0.6996041965535718, and the accuracy rate is about 69%.

In addition to the concepts of accuracy and recall, you can view information such as its accuracy separately for each target feature:

(07).map(target => (metrics.precision(target), metrics.recall(target))).foreach(println)

The following results are output:

(0.6845289541918755,0.6708390193402664)(0.7237410535817912,0.7904577464788732)(0.6385618166526493,0.8483240223463687)(0.5800865800865801,0.44966442953020136)(0.0,0.0)(0.7283950617283951,0.03460410557184751)(0.6814580031695721,0.4405737704918033)

As you can see, the accuracy of each category is different, and it's weird that the 5th category has a accuracy of 0, so this model isn't what we want.
What we want is a realistic, high-accuracy model, so how do we determine the parameters of the model training?
You can choose a different combination when you train the model, and feedback the model accuracy results from each of the training sessions, then we can select the best model to use:

/** * Get the best combination of parameters on the training data set * @param traindata Training DataSet * @param cvdata cross-validation data set * */defGetbestparam (Traindata:rdd[labeledpoint], Cvdata:rdd[labeledpoint]): Unit = {ValEvaluations = for(Impurity <-Array ("Gini","Entropy"); Depth <-Array (1, -); Bins <-Array (Ten, -))yield{ValModel = Decisiontree.trainclassifier (Traindata,7, Map[int, Int] (), impurity, depth, bins)ValMetrics = Getmetrics (model, Cvdata) ((impurity, depth, bins), metrics.precision)} evaluations.sortby (_._2). reverse.f Oreach (println)}

The result of the execution is:

(entropy, (),0.9123611325743918)(Gini, (),0.9062140490049623)(entropy, (+),0.8948814368378578)(Gini, (+),0.8902625388485379)((Gini,1,+),0.6352272532151995)(Gini,1,ten),0.6349525232232697)((entropy,1,+),0.4855337488624461)(entropy,1,ten),0.4855337488624461)

Can see the best combination for the first, the accuracy of about 91%, we can look at the best parameters of the training model, the accuracy of each category is how much:

(0.899412455934195,0.9029350698376746)(0.9193229901269393,0.9203289914928166)(0.9222857142857143,0.9238694905552376)(0.8263888888888888,0.8623188405797102)(0.8294663573085846,0.7663451232583065)(0.8596802841918295,0.8491228070175438)(0.9454369869628199,0.9275225011842728)

It looks a lot better now than it was before.

Until then, we have not used 10% of the test data set, if 10% of the cross-validation data set is to determine the best parameters of the model trained on the training data set
The role of the test data set is to evaluate the best parameters of the CV dataset, combine the training set with the CV set, and then validate the test set with the above steps to get an appropriate combination of parameters

Random Forest

The use of random forests in spark mllib is similar to decision trees:

val7, Map(104114020"auto""entropy"30300)

Unlike the framework decision tree, where empty maps are no longer used
Map (4, 40) indicates that the feature vector is classified as 10, the category has 4, and the category eigenvectors with 11 are 40
At this point, the random forest will no longer have the "size speculation" and other numerical characteristics of the two eigenvectors of the 10 and 11.
This is so much closer to reality that the quality of the resulting model will be better

There are also two more parameters in the random forest:

1. How many decision trees are in the forest, with a value of 20 in the example
2. How to choose the decision rule for each layer, in the example "Auto"

According to the decision tree evaluation process to test the random forest model, we can get some indicators about the accuracy to measure the model quality

At this point, the model for decision trees and random forest algorithms has been built and can be used to predict the data.

GitHub Source Address

@ Little Black

Prediction of forest vegetation by decision tree algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More