Introduction to Random forest algorithms (python)

Source: Internet
Author: User
Stochastic forest is a very flexible machine learning method, which has many applications from marketing to medical insurance. It can be used for marketing to model or predict the patient's disease risk and susceptibility to customer acquisition and retention.

Random forests can be used for classification and regression problems, can handle a large number of features, and can help estimate the importance of modeling data variables.

This article is about how to build a random forest model using Python.

1 What is a random forest

Random forests can be used for almost any kind of predictive problem (including nonlinear problems). It is a relatively new machine learning strategy (born in the 90 's Bell Labs) that can be used in any way. It belongs to the large class of integrated learning in machine learning.

1.1 Integrated Learning

Integrated learning is the combination of multiple models to solve a single prediction problem. Its principle is to generate multiple classifier models, each independently learning and making predictions. These predictions are finally combined to get the predicted results, so the results are the same or better than the results of individual classifiers.

Random forest is a branch of integrated learning because it relies on the integration of decision trees. More about Python for integrated learning: Scikit-learn documentation.

1.2 Random Decision Tree

We know that random forests aggregate other models, but what kind of models are they? As can be seen from its name, random forest aggregates are categorical (or regression) trees. A decision tree is composed of a series of decisions that can be used to classify the observed values of a dataset.

1.3 Random Forest

The random forest algorithm introduced will automatically create random decision tree groups. Since these trees are randomly generated, most of the trees (or even 99.9%) do not make sense to solve your classification or regression problems.

1.4 votes

So what is the benefit of generating even tens of thousands of bad models? Well, that's not true. But what is useful is that a small number of very good decision trees are generated along with it.

When you want to make predictions, the new observations come down from the top of the decision tree and are given a predictive value or tag. Once each tree in the forest is given a predictive value or label, all predictions are which comes together, and all the tree's votes are returned as the final prediction result.

In short, 99.9% unrelated trees make predictions that cover all situations, and these predictions will be offset against each other. The predictions of a few good trees will stand out and get a good prognosis.

2 Why do you use it?

Random forests are Leatherman (multifunctional folding knives) in machine learning methods. You can almost throw anything to it. It does so well in estimating inferred mappings that it does not require as many parameters as the SVM doctor (which is very good for time-critical friends).

2.1 Example of a map

Random forests can be studied without the deliberate manual conversion of data. Take the function f (x) =log (x) as an example.

We will use Python to generate analytical data in Yhat's own interactive environment rodeo, where you can download rodeo mac,windows and Linux installation files.

First, our husband becomes a bit of data and adds noise.

Import NumPy as Npimport Pylab as PLX = Np.random.uniform (1, +, +) y = Np.log (x) + np.random.normal (0,. 3, +) Pl.sca  Tter (x, Y, S=1, label= "log (x) with Noise") Pl.plot (Np.arange (1, +), Np.log (Np.arange (1, +)), c= "B", label= "log (x) True function ") Pl.xlabel (" X ") Pl.ylabel (" f (x) = log (x) ") Pl.legend (loc=" Best ") pl.title (" A Basic log Function ") Pl.show ()

The following results are obtained:

If we set up a basic linear model by using X to predict Y, we need to make a straight line, which must be divided into cities as the log (x) function. And if we use a random forest algorithm, it can better approximate the log (x) curve and make it look more like the actual function.

Of course, you can also say that random forests have a bit too close to the log (x) function. In any case, this suggests that random forests are not limited to linear problems.

3 How to use

3.1 Feature Selection

One of the best use cases for random forests is feature selection. One byproduct of trying a number of decision tree variables is that you can check that the variables behave best or worst in each tree.

When some trees use a variable, and others do not use the variable, you can compare the loss or increase of the information. The relatively good random forest tools implemented can do these things for you, so all you need to do is look at that method or parameter.

In the example below, we try to figure out which variables are most important when distinguishing between red or white wine.

3.2 classification

Random forests are also very good at classifying problems. It can be used to make predictions for multiple possible target categories, and it can also output probabilities after adjustment. One thing you should be aware of is overfitting.

Random forests are prone to overfitting, especially when datasets are relatively small. You should be skeptical when your model makes "too good" predictions for the test set. One way to avoid overfitting is to use only relevant features in the model, such as using the previously mentioned feature selection.

3.3 Regression

Random forests can also be used for regression problems.

I found that unlike other methods, random forests are very good at mixing categorical or categorical variables with continuous variables.

41 Simple Python examples

From sklearn.datasets import load_irisfrom sklearn.ensemble import Randomforestclassifierimport pandas as Pdimport numpy As Npiris = Load_iris () df = PD. DataFrame (Iris.data, columns=iris.feature_names) df[' is_train '] = np.random.uniform (0, 1, len (DF)) <=. 75df[' Species '] = pd. Categorical.from_codes (Iris.target, Iris.target_names) Df.head () train, test = df[df[' Is_train ']==true], df[df[' Is_ Train ']==false]features = DF.COLUMNS[:4]CLF = Randomforestclassifier (n_jobs=2) y, _ = pd.factorize (train[' species ']) Clf.fit (Train[features], y) preds = iris.target_names[clf.predict (Test[features])]pd.crosstab (test[' species '), Preds, rownames=[' actual '], colnames=[' preds '])

Here are the results you should see. Because we randomly select the data, the actual results are different every time.

5 Conclusion

Random forests are quite easy to get. However, as with any other modeling method, you should be aware of the fit problem. If you are interested in using the random forest in the R language, you can view the Randomforest package.


Above is the random forest algorithm introduction (python) content, more relevant content please pay attention to topic.alibabacloud.com (www.php.cn)!

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.