R Language ︱ Decision tree family--stochastic forest algorithm _

R Language ︱ Decision tree family--stochastic forest algorithm __ algorithm

Last Update:2018-07-24 Source: Internet

Author: User

Tags data structures svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Often thought to climb the mountains small, can, often and really come to the starting point, Daniel, slowly footsteps to my notes to share it, please~

———————————————————————————

The author's message: Is there a "supervised learning choice in depth learning or random forest or support vector machine?" (author Bio:sebastianraschka) mentioned that in the daily machine learning work or study, when we encounter the supervision of learning related problems, Consider using a simple hypothetical space (simple model set), such as a linear model logical regression. If the effect is not good, that is not up to your expectations or evaluate the effect of the benchmark, and then another more complex model to experiment.

——————————————————————————————————————————————

introduction of stochastic forest theory

1.1 Advantages and disadvantages

Advantages.

(1) Do not need to worry about too much fit;

(2) There are a large number of unknown features in the dataset;

(3) The ability to estimate which features are more important in the classification;

(4) has the very good noise resistance ability;

(5) The algorithm is easy to understand;

(6) can be processed in parallel.

Disadvantages.

(1) The classification of small data sets and low dimensional datasets may not be very good results.

(2) The speed of execution is faster than boosting, but it is much slower than a single decision tree.

(3) There may be some very small differences in the tree, drown some of the right decisions.
1.2 Build Step Introduction

1, from the original training data set, the bootstrap method is used to randomly extract k a new self-help sample set, and then the K Tree Classification regression tree is constructed, and the samples which are not extracted each time are composed of K-bag data (OUT-OF-BAG,BBB).

2, with n characteristics, at each node of each tree randomly extracted mtry characteristics, by calculating the amount of information contained in each feature, select a feature of the most classification capability of the node division.

3, each tree to maximize the growth, do not make any tailoring

4, the formation of multiple trees into a random forest, with a random forest to classify the new data, classification results according to the tree classifier vote how much.

Comparison of 1.3 random forest and SVM

(1) Do not need to adjust too many parameters, because the random forest only needs to adjust the number of trees, and the number of trees is generally the more the better, while other machine learning algorithms, such as SVM, have a lot of parameters need to adjust, such as the most suitable kernel function, regular penalty, etc.

(2) The classification is simpler and more direct. Both the stochastic deep forest and support vector machines are nonparametric models (the complexity increases with the increase of training model samples). Compared with the general linear model, the training nonparametric model is more time-consuming and cost-consuming in terms of computational consumption. The more classification trees, the more time-consuming it takes to build random forest models. Similarly, the support vector machines that we train have many support vectors, at worst, how many instances of our training set have the support vectors. Although we can use multi-class support vector machines, the implementation of traditional multiple classification problems is generally one-vs-all (the so-called one-vs-all is to apply binary classification method to many kinds of classification. For example, I want to divide into the K class, then one of them as positive, so we still need to train for each class A support vector machine. On the contrary, the decision tree and the random deep forest can solve many kinds of problems without pressure.

(3) Easy to start practice. The random forest is simpler in the training model. You can easily get a good and robust model. The complexity of the stochastic forest model is proportional to the training sample and the tree. Support vector machines require us to do some work on the tuning, and in addition, the computational cost increases linearly with the class increase.

(4) In small data, SVM is excellent, and the data demand of random forest is large. As far as experience is concerned, I prefer to think that support vector machines have advantages over small datasets with less extreme values. Random forests require more data but generally have a very good and robust model.

1.5 Comparison of stochastic forest and depth learning

Deep learning requires models that are larger than random forests to fit the model, and often the depth learning algorithm needs to be more time-consuming, and the process of installing a neural network model to use the depth learning algorithm is more tedious than a ready-made classifier like random forest and support vector machines.

But there is no denying that deep learning is more advantageous in more complex issues, such as image classification, natural language processing and speech recognition.

Another advantage is that you don't need to focus too much on feature engineering related work. In fact, how to choose a classifier depends on the amount of data you have and the general complexity of the problem (and the effect you require). This is a step-by-step experience that you will gain as a machine learning practitioner.

can refer to the paper "An empirical Comparison of supervised Learning algorithms".

1.6 The difference between a random forest and a decision tree

The model overcomes the shortcoming that single tree decision tree is easy to fit, and the model effect improves remarkably in accuracy and stability.

Decision Tree +bagging= Stochastic Forest

1.7 Random forest no reason to fit

In the process of building each decision tree, there are two points to note- sampling and complete splitting . The first is the process of two random sampling, random forest to sample the rows and columns of the entered data. For row sampling, there is a way to put back, that is, in the sampled collection of samples, there may be duplicate samples.

Assuming the input sample is N, then the sampled sample is N. This makes the input samples of each tree not all samples at the time of training, making it relatively difficult to appear over-fitting.

Then take a column sample, from the M feature, select m (M << m). After that, a decision tree is created using a completely split approach to the data after the sampling, so that one of the leaf nodes of the decision tree is either unable to continue splitting, or all the samples in it are pointing to the same classification. Generally a lot of decision tree algorithms are an important step-pruning, but here does not do so, because the previous two random sampling process to ensure the randomness, so even if not pruning, will not appear over-fitting. every tree in a random forest is weak in this algorithm, but it's very powerful in combination.

It can be likened to a random forest algorithm: Each decision tree is an expert in a narrow field (because we choose m from feature to make every decision tree to learn), so in random forests there are many experts proficient in different fields, for a new problem (new input data), You can look at it in different ways, and ultimately by the experts, the results will be voted on.

1.8 The difference between random forest and gradient ascending tree (GBDT)

Stochastic Forest: Decision tree +bagging= Stochastic Forest

Gradient elevation tree: Decision Tree BOOSTING=GBDT

The difference between the two is the difference between bagging boosting, visible:

	Bagging	Boosting
Sampling Method	Bagging using uniform sampling	Boosting sampling according to error rate
Precision, accuracy	, compared to the lower	High
, Training set Select	Random, before each round training set mutually independent	Selection of training sets with previous rounds of learning results Off
Forecast function Weight	Each predictive function has no weight	Boost right heavy
Function Build order	Parallel build	Order generation
Apply	p> algorithms, such as neural networks, which are extremely time-consuming, bagging can save a lot of time overhead by parallel Baging and boosting can effectively improve the accuracy of the classification	Both baging and boosting can effectively improve the accuracy of the classification Some models can cause the degradation of the model (over fit) An improved adaboost method of boosting thinking has good performance in message filtering, text categorization
	Random forest	Gradient elevation tree

——————————————————————————————————————————————

second, random forest importance metrics--importance scoring, Gini index

(1) Importance score

It is defined as the average reduction of the correct rate of classification and the correct rate of classification before the disturbance of the independent variable value of the bag data.

(1): For each decision tree, the use of bag data to predict the outside of the bag data forecast error will be recorded. The error of each tree is: Vote1,vote2,voteb;

(2): Randomly transforms each predictor variable, thus forms the new bag outside data, then uses the bag outside data to carry on the verification, its each variable error is: vote11,vote12,,vote1b.

(3): For a certain predictive variable, the importance of the calculation is the mean value of the difference between the predicted error and the original.

R Language Code:

RF <-Randomforest (species ~., Data=a, ntree=100, Proximity=true,importance=true)

(2) Gini index

The Gini index indicates the purity of the node, the higher the Gini index, the lower the purity. The average reduction of Gini value indicates the average reduction of the impurity of the variable partition node of all trees. For variable importance measurements, the steps, as described earlier, disrupt the variable data and Gini the mean value of the exponential change as a measure of the importance of the variable.

Gini (T) =1−∑j=1np2j

(3) Importance drawing function--varimpplot (RF) function

——————————————————————————————————————————————

third, random forest model R language Practice

3.1 Random forest model points of attention

The difference between the classification task and the regression prediction task in the model:

The different operation of stochastic forest model, classification and regression prediction is to judge the type of dependent variable, if the dependent variable is a factor, perform the classification task, if the dependent variable is a continuous variable, then perform the regression prediction task.

Requirements for data structures in the model:

The ' randomforest ' function requires that the original data box be adjusted to a data box with a column name (variable) for each word as a data box or matrix. In the process of text mining, it is necessary to convert the word frequency (transverse, long data) into variables (wide type longitudinal data), which can be realized by Reshape2 and data.table packets in Dcast. Specific combat See blog: R language ︱ Supervision algorithm of emotional analysis Notes 4.1 sections.

Two parameters of random forest:

Candidate Feature number k
The larger the K, the greater the effect of the single tree, but also the correlation between the trees.
Number of decision Trees m
The larger the M, the better the model will be, but the larger the amount of calculation will become.

R in the decision tree-related package:
Single Tree decision tree: RPART/TREE/C50
Random Forest: Randomforest/ranger
Gradient Lifting Tree: gbm/xgboost
Visualization of Trees: Rpart.plot

3.2 Model Fitting

This paper takes Iris, the data set in R language, as an example, takes setosa as the dependent variable and the other as the independent variable to fit the model, because the setosa itself is the factor type, so no conversion form is necessary.

> Data <-Iris
> Library (randomforest)
> System.time randommodel <-randomforest (species ~., Data=data,importance = TRUE, proximity = FALSE, ntree = m)
user system passes 
   0    0    0 
> Print (Randommodel) C7/>call:
 randomforest (formula = Species ~., data = data, importance = TRUE,      proximity = FALSE, ntree = m) 
  
   type of random forest:classification number of
                     trees:100
No. Variables tried at each split:2

        OOB estimate of  error rate:3.33%
confusion matrix:
           setosa Versico Lor virginica class.error
setosa          0         0        0.00
versicolor      0         3        0.06
virginica       0          2        0.04

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More