Scikit-learn and the return tree

Source: Internet
Author: User
principle of regression algorithm

The CART (classification and regression tree) algorithm is one of the most mature algorithms in decision trees, and its application scope is very wide. It can be used for classification.
Western prediction theory is generally based on regression, cart is a decision tree method to achieve regression algorithm, it has many other global regression algorithms do not have the characteristics.
When the regression model is created, the sample value is divided into observation value and output value of two kinds, observation value and output value are continuous, unlike the classification function, there is no classified label, only according to the data characteristics of data set to create a predictive model, reflecting the trend of the curve. In this case, the optimal partitioning rule of the original classification tree no longer works.in the prediction, CART uses the minimum residual variance (squared residuals minimization) to determine the optimal partition of the regression tree, and this criterion expects the least error variance of the subtree and the sample point after partition.。 The decision tree divides the dataset into many sub model data, and then uses the linear regression technique to model it. If the subset of data after each segmentation is still difficult to fit, continue to slice. In this method, each leaf node is a linear regression model created by the prediction tree.These linear regression models reflect the patterns contained in the sample set (observation set), also known as the Model tree。 Therefore, cart not only supports Roman prediction, but also supports the prediction of local patterns, and the ability to find patterns from the whole, or to combine them into a whole. The combination of the whole and the pattern is of great value for the prediction analysis. Therefore, the application of the CART decision tree algorithm in the prediction is very extensive.
The following is an introduction to the algorithm flow of cart:
(1) Decision tree Main function: The main function of decision tree is a recursive function. The main function of the function is to grow each branch node of the decision tree according to the rules of the cart and end the algorithm according to the termination condition.
A. Enter the data sets and category labels that you want to categorize.
B. Use the minimum residual variance to determine the optimal partition of the regression tree and create the partition node of the feature--the minimum residual variance child function.
C. Dividing the data set of nodes into two parts--two-point data collection function.
D. Construct a new left-right node based on the results of the binary data, as two branches of tree growth.
E. Verify compliance with recursive termination conditions.
F. Recursively perform the above steps by incorporating the data sets and category labels that are divided into new nodes as input.
(2) The minimum residual variance function is used to compute the optimal partition variance, dividing column and dividing value of each column in the dataset.
(3) Binary data set: The data set is divided into two parts according to the given delimited column and the separated value, and is returned separately.minimum residual variance method

In the regression tree, the dataset is continuous. The processing method of continuous data is different from the discrete data, and the discrete data is divided according to the value of each feature, while the continuous feature is to calculate the optimal partition point. However, it is very simple to compute the linear correlation on the continuous dataset, and the algorithm is derived from the least square method.
Minimum residual variance method, firstly, the mean value and total variance of the data column are obtained. There are two ways to calculate the total variance
calculate the mean STD, compute the variance of each data point and STD, and then sum the N points .
The variance var is evaluated, and then Var_sum = Var*n,n is the number of data set data.
So the selection process for each of the best branching features is as follows.
(1) The best variance of shillings is infinitely large Bestvar = INF.
(2) This traversal of all feature columns and all the sample points of each feature column (this is a two loop), the binary data set at each sample point.
(3) Calculates the total variance currentvar after the binary data set, if Currentvar < Bestvar, then Bestvar = Currentvar.
Returns the optimal branching feature column, the branch eigenvalue (the value of the contiguous feature is the dividing point), and the left and right branch data set to the main program. Model Tree

Using cart to predict is to set the leaf node as a series of piecewise linear functions, these piecewise linear functions are a simulation of the source data curve, each linear function is called a model tree. The model tree has many excellent properties, and it contains the following characteristics.
In general, the overall reproducibility of the sample is not very high, but the local pattern is often repeated, that is, the history is not simple repetition, but will repeat itself. The model is more useful than the overall forecast for the future.
The model gives the range of the data, which may be a time range or a spatial range, and the model also gives a trend of change, either a curve or a straight line, depending on the regression algorithm used. These factors make the model have strong explanatory ability.
The traditional regression method, whether linear regression or nonlinear regression, is not as rich as the information contained in the model tree, so the model tree has higher prediction accuracy. Scikit-learn Implementation

#!/usr/bin/python
# Created by Lixin 20161118
import numpy as NP-
numpy import * from
sklearn.tree imp ORT decisiontreeregressor
import Matplotlib.pyplot as PLT


def plotfigure (X,X_TEST,Y,YP):
        plt.figure ()
        Plt.scatter (x,y,c= "K", label= "data")
        Plt.plot (x_test,yp,c= "R", label= "max_depth=5", linewidth=2)
        Plt.xlabel ("Data")
        Plt.ylabel ("target")
        plt.title ("Decision tree regression")
        plt.legend (loc= ' Upper right ')
        plt.show ()
        #plt. Savefig ('./res.png ', format= ' png ')


x = Np.linspace ( -5,5,200)
Siny = Np.sin (x)
x = Mat (x). T
y = siny + np.random.rand (1,len (siny)) *1.5
y = y.tolist () [0]
CLF = decisiontreeregressor (max_depth=4)
Clf.fit (x,y)

X_test = Np.arange ( -5.0,5.0,0.05) [:, Np.newaxi
YP = clf.predict (x_test)

plotfigure (X,X_TEST,Y,YP)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.