Regression prediction Analysis (RANSAC, polynomial regression, residual plot, random forest)

Source: Internet
Author: User
Tags lstat square root

In this article, the main introduction is to use the Boston house price data to master regression prediction analysis of some methods. Through this article you can learn: 1, the important characteristics of visual data sets
2. Estimating coefficients of regression models
3. Using RANSAC to fit the high robustness regression model
4. How to evaluate the regression model
5. Polynomial regression
6. Decision Tree Regression
7. Stochastic Forest regression

DataSet Download Address: Https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data

Data feature Description: Https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names


Master the basic information of the data

Import pandas as PD
import matplotlib.pyplot as Plt
import Seaborn as SNS
#设置seaborn的风格
sns.set (style= " Whitegrid ", context=" notebook ")

if __name__ = =" __main__ ":
    #读取数据集
    data = pd.read_csv (" Data/train.csv ")
    #获取csv的前5行数据
    Print (Data.head (5))

the visualization of data features

Exploratory data Analysis (exploratory): An important step before machine learning model training, by drawing with the help of Python's Third-party library Pandas and Seaborn, we can analyze and discover anomalies in the data, The distribution of data, and the correlation of characteristics.

Because of the screen size relationship, we select four independent variables and dependent variables for analysis, Indus (proportion of the home town without retail business area), NOx (nitric oxide concentration, per one out of 10,000), RM (average number of rooms per apartment), Lstat (proportion of vulnerable population), Medv ( Average house price, Unit 1000 USD). 1. Draw Scatter Chart

    #选择需要绘制散点图的列名
    cols = ["Lstat", "Indus", "NOx", "rm", "Medv"]
    #通过seaborn绘制散点图
    sns.pairplot (data[cols), size=1.5)
    plt.show ()


We can find the relationship between variables by drawing the scatter plot of the feature. Diagonal is the histogram of the variable, you can see the distribution of features. The histogram of Medv (house price) shows that the house price obeys the normal distribution, but contains several outliers in the section greater than 40. The Scatter Chart of RM (room number) and Medv (house price), row fourth, column fifth, is linearly distributed. And the other three variables and MDV (house price) is a non-linear distribution.

Note: Training a linear regression model does not require the independent variable or the dependent variable to be normal distribution, the normal hypothesis is only applicable to some statistical tests and hypothesis tests. 2. Coefficient matrix

In addition to finding the relationship between variables through scatter graphs between variables, the relationship between variables can be found by correlation coefficients between variables. Correlation coefficient matrices, commonly used matrices with Pearson correlation coefficients (Pearson product-moment correlation Coefficient,pearson ' s R), can measure the linear relationship between two characteristics. Pearson's correlation coefficient is in the range of [ -1,1], if r=1, two variables are positively correlated, r=0 indicates that two variables are not related, r=-1 indicates that two variables are negatively correlated. In fact, the correlation coefficient matrix is a standardized covariance matrix.

    #获取相关系数矩阵
    cm = Np.corrcoef (data[cols].values. T)
    #设置字的比例
    sns.set (font_scale=1.5)
    #绘制相关系数图
    HM = Sns.heatmap (cm,cbar=true,annot=true,square=true , fmt= ". 2f",
                     annot_kws={"size": 15},yticklabels=cols,xticklabels=cols)
    plt.show ()


Through the correlation coefficient matrix, we can find that the correlation between Lstat and Medv is the largest (-0.74), and the second is that RM and Medv are the most correlated. This can also be illustrated by the previous scatter chart. two, commonly used linear regression algorithm

Analysis of the linear relationship between RM and Medv 1, linear regression

    #获取特征和目标变量的值
    X = data["rm"]
    Y = data["Medv"] from
    sklearn.linear_model import linearregression
    # Create a linear model object
    linear = linearregression ()
    X = Np.array (x). Reshape (333,1)
    Y = Np.array (y). Reshape (333,1)
    #训练模型
    linear.fit (x,y)
    #绘制点
    plt.scatter (X, Y, c= "Blue")
    #绘制直线
    pred_y = Linear.predict (X)
    plt.plot (x,pred_y,c= "Red")
    Plt.show ()

The relationship between RM and MEDV data is used to draw a straight line. Through the point set of RM and Medv, we can find that there are many outliers in the periphery, and the outliers have a serious effect on the linear regression model. The following RANSAC algorithm clears the exception value. 2, Ransac fitting high robustness regression

RANSAC (Random sample Consenus), which is based on a set of sample datasets containing abnormal data, calculates the mathematical model parameters of the data and obtains the algorithm of effective sample data. A subset of the data (interior point, Ran) is used to fit the regression model.

The work flow of the RANSAC algorithm is as follows:

1, from the data set randomly sampled samples to build interior point set fitting model.

2. Use the remaining data to test the model obtained from the previous step, and add the sample points falling within the predetermined tolerance range to the inner point collection.

3. Use all the interior point set data to fit the model again.

4, using the internal point set to estimate the error of the model.

5, if the model performance reached the user set a specific threshold or the number of iterations reached a predetermined number of times, the algorithm is terminated, or repeat from the first step.

    From Sklearn.linear_model import ransacregressor
    ' '
    max_trials set the maximum number of iterations
    Min_ The minimum number of samples randomly sampled is
    residual_metric to calculate the absolute value of the distance between the fitted curve and the sample point
    residual_threshold Set the predetermined tolerance, which is less than this value to be added to the inner point set
    '. '
    Ransac = Ransacregressor (Linearregression (), max_trials=100,min_samples=50,
                             Residual_metric=lambda x: Np.sum (Np.abs (x), Axis=1),
                             residual_threshold=5.0,random_state=0)
    ransac.fit x,y (inlier_mask)
    = Ransac.inlier_mask_
    outlier_mask = Np.logical_not (inlier_mask)
    line_x = Np.arange (3,10,1)
    Line_y_ Ransac = Ransac.predict (Line_x[:,np.newaxis])
    plt.scatter (x[inlier_mask],y[inlier_mask],c= "Blue", marker= "O" , label= "Inside Point")
    Plt.scatter (x[outlier_mask],y[outlier_mask],c= "LightGreen", marker= "s", label= "anomaly value")
    Plt.plot (line_x,line_y_ransac,color= "Red")
    Plt.xlabel ("Room number")
    Plt.ylabel ("house price")
    Plt.legend (loc= " Upper left ")
    Plt.show ()


evaluation of the performance of linear regression model 1, residual map

The residual graph can be used to find out the difference between the real value and the predicted value or the vertical distance, and the regression model is evaluated by the difference between the real value and the predicted value. Residual graphs can be used as graphic analysis methods to evaluate the regression model, to obtain the abnormal value of the model, and to check whether the model is linear and whether the error is randomly distributed.

    #获取特征和目标变量的值
    X = data["rm"]
    Y = data["Medv"] from
    Sklearn.linear_model import linearregression
    from Sklearn.model_selection import train_test_split
    # #创建一个线性模型对象
    linear = linearregression ()
    x = Np.array (x) . Reshape (333,1)
    Y = Np.array (y). Reshape (333,1)
    # #训练模型
    linear.fit (x,y)
    #将数据集分为训练集和测试集
    train_x,test_x,train_y,test_y = Train_test_split (x,y,test_size=0.2)
    train_y_pred = linear.predict (train_x)
    test_y_pred = linear.predict (test_x)
    plt.scatter (train_y_pred,train_y_pred-train_y,c= "Blue", marker= "O ", label=" training data ")
    Plt.scatter (test_y_pred,test_y_pred-test_y,c=" LightGreen ", marker=" s ", label=" test data ")
    Plt.legend (loc= "upper left")
    plt.hlines (y=0,xmin=-10,xmax=50,lw=2,color= "Red")
    Plt.xlim ([ -10,50])
    Plt.xlabel ("predictive value")
    Plt.ylabel ("residuals")
    plt.show ()


The best model predicts a residual error of 0, which is unlikely to happen in practical applications. However, for a good model, we expect the error to be randomly distributed, while the residuals are also fluctuating near the y=0 horizon. Anomaly values can also be found by residual graphs, which deviate from the y=0 points. 2, mean square error (MSE)

Mean square error (Mean squared Error,mse): The mean of the square sum of the difference between the true value and the predicted value, the formula is as follows


    From sklearn.metrics import mean_squared_error
    print ("Train mse:%.3f"%mean_squared_error (train_y,train_y_pred) )
    #train mse:46.037
    print ("Test mse:%.3f"%mean_squared_error (test_y,test_y_pred))
    #test mse:35.919

Besides the mean square error, the performance of the model can be measured by absolute error. 3. Determining coefficient r^2

In some cases the decision factor (coefficient of determination) r^2 is very important and can be viewed as a standardized version of MSE, r^2 is the fraction of the model capturing response variance. For the training set, the value range of r^2 is [0,1], and for the test set, r^2 value may be negative. The closer the r^2 to the 1 indicates the better the performance of the model. R^2 calculation formula is as follows:


    From sklearn.metrics import r2_score
    print ("Train r2:%.3f"%r2_score (train_y,train_y_pred))
    #train r2:0.473
    print ("Test r2:%.3f"%r2_score (test_y,test_y_pred))
    #test r2:0.485
Iv. Nonlinear regression 1. Polynomial regression

Polynomial regression can be used to satisfy the linear regression by adding polynomial.


In front of a scatter plot to draw four characteristics and house prices before the relationship between Lstat and house prices to show a non-linear relationship, we use linear regression to polynomial regression to fit the lstat and the linear relationship between the house price, the comparison of r^2 changes.

    From Sklearn.linear_model import linearregression to sklearn.preprocessing import polynomialfeatures from SK Learn.metrics Import r2_score #线性回归 X = Np.array (data["Lstat"]). Reshape (333,1) Y = Np.array (data["Medv"]) . Reshape (333,1) linear = Linearregression () linear.fit (x,y) linear_y = linear.predict (X) print ("Linear r^2
    :%.3f "%r2_score (y,linear_y)) #linear r^2:0.546 #多项式回归 qua_linear = linearregression () #设置x的最大次数为2 quadratic = polynomialfeatures (degree=2) X_quad = Quadratic.fit_transform (X) qua_linear.fit (X_quad,Y) Qua_lin

    ear_y = qua_linear.predict (x_quad) print ("Polynomial r^2:%.3f"%r2_score (y,qua_linear_y)) #polynomial r^2:0.641 #绘点 Plt.scatter (x,y,label= "Data", marker= "O", color= "blue") X_linear = Np.arange (Np.min (x), Np.max (x), 1) [:, Np.newaxis
    ] #线性回归 Plt.plot (x_linear,linear.predict (x_linear), label= "linear regression", color= "green", lw=2,linestyle= "-") #多项式拟合 Plt.plot (x_Linear,qua_linear.predict (Quadratic.fit_transform (x_linear)), label= "Polynomial regression", color= "Red", lw=2,linestyle= "-") Plt.xlabel ("Lstat (% of vulnerable population)") Plt.ylabel ("Medv (house price)") Plt.legend (loc= "upper right") plt.show ()


2. Feature Conversion

In addition to the polynomial regression, log regression can be used for this nonlinear regression. For the relationship between Lstat and Medv, the Lstat can be converted to log logarithm, medv the square root conversion and then use linear regression.

    #特征转换
    x_log = Np.log (X)
    y_sqrt = np.sqrt (Y)
    linear = linearregression ()
    Linear.fit (X_LOG,Y_SQRT)
    print ("r^2:%.3f"%r2_score (Y_sqrt,linear.predict (x_log))
    #R ^2:0.692 plt.scatter

    (x_log,y_sqrt, Label= "Data", marker= "O", color= "Blue")
    lin_x = Np.arange (Np.min (X_log), Np.max (X_log), 1) [:, Np.newaxis]
    Plt.plot (Lin_x,linear.predict (lin_x), label= "linear fit", linestyle= "-", color= "Red")
    Plt.show ()

It can be found by the above figure that the transformation of the feature from the original non-linear relationship to the linear relationship, and the r^2 is better than the results of multiple regression. 3. Random Forest

Random forest is an integrated algorithm, which is composed of multiple decision trees. Can reduce the variance of the model, the stochastic forest usually has better generalization performance than the single decision tree. It is insensitive to outliers in a dataset and does not require excessive parameter tuning.

    From sklearn.ensemble import randomforestregressor from
    sklearn.model_selection import train_test_split

    # Divide the dataset into test sets and training sets
    train_x,test_x,train_y,test_y = Train_test_split (x,y,test_size=0.3)
    '
    n_estimators: Number of estimators (trees)
    criterion: Optimize target '
    forest = randomforestregressor (n_estimators=100,criterion= "MSE", n_jobs= 1)
    Forest.fit (train_x,train_y)
    print ("Train r2:%.3f"%r2_score (train_y,forest.predict))
    # Train r2:0.929
    print ("Test r2:%.3f"%r2_score (Test_y,forest.predict (test_x))
    #test r2:0.227
Through the above results, we can find that the random forest has been fitted.






Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.