Pattern Recognition (Recognition) Learning notes (30)--random forest (Forest)

Source: Internet
Author: User

Introduction

Pattern recognition is a data-based discipline, so the same problem that all pattern recognition problems face is the random problem of data. The implementation of each method in pattern recognition is based on a specific set of data samples, but this sample set is only a random sample of all possible samples, after all, in our life in reality there are all sentient beings, more than our number is countless, even the computer can not count the clear, And we collected at best is only a small part of it, this is what machine learning is missing is only the data, as long as there are enough learning data, it will be able to achieve amazing results, so pattern recognition and machine learning in many ways to achieve the results are undoubtedly affected by this randomness, The classifiers we have trained are also incidental, especially when the samples are not enough.

For the decision tree, its tree growth is a greedy algorithm, only considering the current local optimal, so it will be more serious by this randomness, which is why some decision tree generalization ability is so poor reason.

In response to the effect of this randomness, one of the earliest statistics was proposed as a strategy called "Bootstrap (Bootstrap)", the basic idea is to repeat samples of the existing sample to produce a plurality of sample subsets, through this multiple repeated sampling to simulate the randomness of the data, The effect of this randomness is then added to the result of the final output. Then some people put this kind of bootstrap thought into pattern recognition, derived a series of solutions, such as random Forest, Bagging, AdaBoost, and so on, this blog to learn what is random forest.

Basic ideas

Random forest is based on decision Tree, from its name "forest" two words can be seen there must be a lot of what the ghost composed of a dense big forest ah, that is what the devil Pinch, in fact, is the decision tree; Random forest is actually building many decision trees, forming a decision tree "forest", Decisions are then made by voting on multiple decision trees, usually with the most votes. Random forest, like C4.5 algorithm, can not only deal with discrete numerical features, but also can deal with continuous numerical features, which is not available in the ID3 algorithm.

Specific practices

1) According to the actual needs, the number of decision trees that need to be constructed t;

2) First sample data for self-weight sampling, to generate a plurality of sample subsets; what is self-weight sampling, is high school probability problem when the most exposed to the problem, that is, every time from n samples are put back randomly out of one, so take N, and finally get n samples, of course, it is possible to take a duplicate sample, But it doesn't matter;

3) Random extraction is used to construct the characteristics of the decision tree: each time from all candidate features randomly selected m features, as the current node under the choice of decision-making features, and then based on the comparative information gain method to select the best to divide the training sample characteristics;

4) using the selected representative features above, each resampling sample set is constructed as a training sample to construct a decision tree.

5) After obtaining a number of decision trees in advance, the output of each tree is voted, and the decision of the most votes is the final decision of the random forest.

The above principle, at a glance:


It can be found that in the above steps, the random forest has done two aspects of sampling: 1) Sampling the training samples, 2) sampling the features, thus guaranteeing the independence of each tree and irrelevant to each other, thus making the final poll result unbiased. In addition, the above implementation process involves two man-given parameters: 1) Number of trees, 2) The number of alternative features, the two parameters in the actual application must be practical and experience to choose, generally speaking, the number of trees is better, after all, some people vote is fair, and the choice of characteristics, not too much should not be too small, Can be set to the square root of the total number of candidate features;

Pros and Cons analysis

Advantages:

1) can deal with discrete features and continuous features;

2) There is no learning phenomenon;

3) Reduce the chance of sample, the result is unbiased;

4) The reliability of the results can still be ensured under the condition that the sample data distribution is very uneven;

5) can handle high-dimensional data;

6) Training and prediction speed;

7) After the training, will clearly give those characteristics have high discrimination;

8) Simple and efficient;

Disadvantages:

1) need to determine the characteristics of the number of people who have no experience is not too good to take;

2) Although the independence of the tree can ensure that there is no learning, but when the data is mixed with noise in the case can not be guaranteed;

3) will be misled by sample data with multiple candidate feature attributes;



Pattern Recognition (Recognition) Learning notes (30)--random forest (Forest)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.