Paper 56: Algorithms in machine learning: random forest with decision Tree Model combination (Forest)

Source: Internet
Author: User

The Friday group will come to the meeting, discussing a more interesting topic, that is, using SVM and random forest to train images, the purpose is to

To establish an intrinsic connection between image features, this model of training, really need to do a good research, the following is what we need to prepare the introductory materials:

[Basic knowledge of decision Trees reference: http://blog.csdn.net/holybin/article/details/22914417]

In machine learning, random forests are made up of a number of decision trees, because the decision trees are formed using random methods, so called random forests. There is no correlation between decision trees in random forests, and when the test data goes into random forests, it is in fact that each decision tree is categorized to see which category the sample should belong to, and finally the class that has the most classification results in all decision trees is the final result (the weight of each tree should be taken into account). All tree training uses the same parameters, but the training set is different and the classifier's error estimates are in the OOB (out of bag) approach. So a random forest is a classifier that contains multiple decision trees, and its output category is determined by the number of categories that the individual tree outputs. Random forests can handle both the amount of a property as a discrete value, such as the ID3 algorithm, or the amount of the attribute as a continuous value, such as the C4.5 algorithm. In addition, random forests can also be used for unsupervised learning clustering and anomaly detection.

Establishment of random forest

The basic is two steps: random sampling and complete splitting.

(1) Random sampling

The first is the two random sampling process, where the random forest samples the input data for rows and columns.

For line sampling, there may be duplicate samples in the sample set that is put back, that is, in the sampled collection. Assuming that the input sample is N, the sampled sample is also N, which selects a good n sample to train a decision tree as a sample at the root node of the decision, while at the time of training, the input samples of each tree are not all samples, making it relatively difficult to appear over-fitting.

For column sampling, select m (M << m) from M-feature, i.e., when each sample has M attributes, when each node in the decision tree needs to be split, the M attribute is randomly selected from this m attribute to satisfy the condition M << m.

(2) completely split

A decision tree is created using a completely fragmented approach to the sampled data so that one of the leaf nodes of the decision tree is either unable to continue splitting, or all the samples in it are pointing to the same category. The split approach is to use the above-mentioned column sampling procedure to select 1 attributes as the split attribute of the node from the M attribute using some strategy (for example, information gain).

Each node in the decision tree formation process is split in a completely split-up manner until it is not able to be split again (if the next time the node is selected is a property that was just used when the parent node was split, the node has reached the leaf node and does not need to continue splitting).

We use Learnunprunedtree (x, y) to represent the process of generating an unnamed decision tree, the following shorthand Lut (x, y):

--------------------------------------------------------------------------------------------------------------- ------------------------------------
Learnunprunedtree (x, y)
Input:
X is a matrix of RXM, and Xij represents the J feature of the first sample.
Y is a vector of Rx1, and Yi represents the category label for the I sample.
Output:
A tree that has not been pruned
If all the sample values of x are the same, or if all the category labels of y are the same, or r<2, a leaf node is generated, and the category of that node is the most common category in X.
Otherwise
Select M randomly from M features
In this m feature, the maximum information gain is recorded as P. (The method of calculating information gain is shown below)
If the value of the characteristic p is non-sequential (e.g. Gender: "Male", "female")
Any value of P is V
Using XV to denote a sample of characteristic p with a value of V, YV for its corresponding category
CHILDV =lut (XV,YV)
Returns a tree node that splits at feature p, and the number of children is the same as the number of distinct values of the characteristic p. V ' child is CHILDV = LUT (XV,YV)
If the value of the characteristic p is continuous (such as temperature, length, etc.), set T to be the best split threshold value
XLO represents a sample collection of the value of the feature P <t, Ylo for its corresponding category
Childlo = LUT (XLO, Ylo)
Xhi represents the sample set of the value of the characteristic p >=t, which is the corresponding category of the Yhi
Childlo = LUT (Xhi, YHI)
Returns a tree node, split at feature p, with 2 children, respectively Childlo = Lut (XLO, Ylo) and Childlo = Lut (Xhi, YHI).

--------------------------------------------------------------------------------------------------------------- ------------------------------------
First of all, the above is the process of non-pruning decision tree generation, a lot of decision tree algorithms will include pruning process to avoid over-fitting. However, because random forest two random sampling process to ensure the randomness, so even if not pruning is not easy to appear over-fitting, which is one of the advantages of random forest.

Secondly, the classification ability of each decision tree generated by the above method is very limited (choose M from M feature to allow each decision tree to learn), but the classification ability is greatly strengthened after the combination of the formation of the forest, which is much like the idea that the weak classifier in AdaBoost is combined into a strong classifier. And in the end they are combined in a weighted way.

Finally, the random forest has 2 parameters need artificial control, one is the number of trees in the forest, generally recommended to take a large. The other is the size of M, and the recommended value for M is the RMS of M.

Advantages of random forests
Summarized as follows:
(1) More suitable for long classification problems, training and prediction speed, in the data set good performance;
(2) The fault tolerance of training data is strong, it is an effective method to estimate missing data, when there is a large proportion of data in the dataset is missing, still can maintain the accuracy and can effectively deal with large data sets;
(3) Be able to handle very high dimensional data, and do not have to do feature selection, namely: can be processed without the deletion of thousands of variables;
(4) can generate an internal unbiased estimation of the generalization error in the process of classifying;
(5) Ability to detect the interaction between features and the importance of characteristics during the training process;
(6) There will be no over-fitting;
(7) Implementation is simple and easy to achieve parallelization.

Code and resources

1, the original author Leo breiman:http://stat-www.berkeley.edu/users/breiman/randomforests/.

His paper and his generalization of the paper.

2, followed by mahout:http://mahout.apache.org/

3, Andy Liaw and Matthew Wiener's R language code: http://cran.r-project.org/web/packages/randomForest/

Google Code version: code.google.com/p/randomforest-matlab/

4, before the online found a MATLAB and Fortran mixed programming example, I compiled in MATLAB use is OK: pan.baidu.com/share/link?shareid=1552171579&uk= 2383340416

5, OPENCV also have the realization of random forest: http://blog.csdn.net/holybin/article/details/25708347

Paper 56: Algorithms in machine learning: random forest with decision Tree Model combination (Forest)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.