Dataset
The Leaning Tower of Pisa is one of Italy's largest tourist attractions. Over the past hundreds of years the tower has been slowly leaning aside, culminating in a 5.5-degree tilt angle that deviates nearly 3 meters at the top level. The annual data pisa.csv documents the tilt of the measuring tower from 1975 to 1987, where lean represents the angle of deviation. In this task, we will try to use linear regression to estimate the tilt rate and interpret its coefficients and statistics.
"lean": [2.9642, 2.9644, 2.9656, 2.9667, 2.9673, 2.9688, 2.9696, 2.9698, 2.9713, 2.9717, 2.9725, 2.9742, 2.9757]})print(pisa)‘‘‘ lean year0 2.9642 19751 2.9644 19762 2.9656 19773 2.9667 19784 2.9673 19795 2.9688 19806 2.9696 19817 2.9698 19828 2.9713 19839 2.9717 198410 2.9725 198511 2.9742 198612 2.9757 1987‘‘‘plt.scatter(pisa["year"], pisa["lean"])
Fit the Linear Model
From the scatter plot we can see that the curve can be used to fit the data nicely. Before we used linear regression to analyze the quality of wine and the stock market, but in this task, we will learn how to understand the key statistical concepts. statsmodels is a library of rigorous statistical analysis in Python, and for linear models, Statsmodels provides a sufficient number of statistical methods and appropriate evaluation methods. SM. The OLS class is used to fit the linear model, and the optimization method is the least squares. OLS () does not automatically add a intercept to the model, but you can add a column of properties yourself so that the value is 1 to produce the intercept.
Import Statsmodels.api as Smy = pisa.lean # targetx = pisa.year # featuresx = Sm.add_constant (X) # Add a column of 1 s as the constant term# OLS--Ordinary Least squares fitlinear = Sm. OLS (Y, X) # Fit Modellinearfit = Linear.fit () print (Linearfit.summary ())" "OLS Regression Results ======================================== ======================================Dep. Variable:lean r-squared:0.988model:ols Adj. R-squared:0.987method:least Squares f-statistic:904.1date: Mon, APR Prob (f-statistic): 6.50e-12time:13:30:20 Log-likelihood : 83.777No. Observations:13 AIC: -163.6df residuals:11 BIC: -162.4DF model:1covariance Type:nonrobust ======================================== ======================================Coef std Err t p>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const 1.1233 0.061 18.297 0.000 0.988 1.258year 0.0009 3.1e-05 30.069 0.000 0.001 0.001==============================================================================omnibus: 0.310 Durbin-watson:1.642prob (Omnibus): 0.856 Jarque-bera (JB): 0.450skew:0.094 Prob (JB): 0.799kurtosis:2.1 Cond. No. 1.05E+06==============================================================================WARNINGS:[1] Standard Errors assume the covariance matrix of the Errors is correctly specified. [2] The condition number is large, 1.05e+06. This might indicate, there arestrong multicollinearity or other numerical problems. '
Define A Basic Linear Model
When printing summary, there is a lot of information about the model, in order to find out these statistical indicators, we need to carefully study the standard linear regression model, the following model of EI is the difference between the predicted value and the real value, meaning that our default model error is normally distributed, the average value is 0.:
Calculate the residuals for the Model (residuals):
# Our predicted values of yyhat = linearfit.predict(X)print(yhat)‘‘‘[ 2.96377802 2.96470989 2.96564176 2.96657363 2.96750549 2.96843736 2.96936923 2.9703011 2.97123297 2.97216484 2.9730967 2.97402857 2.97496044]‘‘‘residuals = yhat - y‘‘‘‘pandas.core.series.Series‘>)0 -0.0004221 0.0003102 0.0000423 -0.0001264 0.0002055 -0.0003636 -0.0002317 0.0005018 -0.0000679 0.00046510 0.00059711 -0.00017112 -0.000740Name: lean, dtype: float64‘‘‘
Histogram of residuals
- We used the histogram (histograms) to show the distribution of the data, and now we can also show the distribution of the residuals to see if it satisfies the normal distribution (there are many statistical tests to check the normal distribution):
plt.hist(residuals, bins=5)
- Since there are only 13 samples of our dataset, this histogram does not make much sense, although there are 4 samples in the middle of the highest
Sum of Squares
Statistical measurements of many linear regression models are dependent on three square measurements: Error (SSE), Regression sum of squares (RSS), and total Sum of squares (TSS).
Error (SSE): Sum of squares of difference between true and predicted values
Regression Sum of squares (RSS): The sum of squares of the difference of the mean of the predicted and true values, where the mean is the mean value of the true value. If you set the forecast value to the mean of the observation, the RSS is very low, but that doesn't make any sense. Instead is a big RSS and a small SSE represents a good model.
Total Sum of squares (TSS): The sum of squares between observations and the mean of the observed values is probably the variance of the training set.
Tss=rss+sse: variance of total data = variance of the Model + variance of the residuals
importas np# sum the (predicted - observed) squaredSSE = np.sum((yhat-y.values)**2)‘‘‘1.9228571428562889e-06‘‘‘# Average yybar = np.mean(y.values)# sum the (mean - predicted) squaredRSS = np.sum((ybar-yhat)**2)‘‘‘0.00015804483516480448‘‘‘# sum the (mean - observed) squaredTSS = np.sum((ybar-y.values)**2)‘‘‘0.00015996769230769499‘‘‘print(TSS-RSS-SSE)‘‘‘3.42158959043e-17‘‘‘
R-squared
- Linear judgment (coefficient of determination) is also called r-squared, which is used to measure linear dependence. It is a number used to tell us the variance of the model in the total variance of the data:
- Before mentioning a low SSE and a high RSS representation of a good model fitting, this r-squared represents this meaning, between 0 and 1.
R2 = RSS/TSSprint(R2)‘‘‘0.987979715684‘‘‘
T-distribution
Statistical tests indicate that the tilt of the tower is related to the year, and a common statistical significance test is student T-test. The basis of this test is the T distribution. Similar to normal distribution, they are bell-shaped but have a lower peak. The T test is a test method for the two mean difference in small samples (sample size less than 30). It uses the T distribution theory to infer the probability of the difference occurring, thus determining whether the difference between the two averages is significant.
From scipy.stats import t# all values between-3 and 3x = Np.linspace ( -3,3,100) # Compute the PDF with 3 degrees of freedom Print (T.pdf (x=x, df=3))" "[0.02297204 0.02441481 0.02596406 0.02762847 0.0294174 0.0313410.03341025 0.03563701 0.03803403 0.04061509 0.04339497 0.046389520.04961567 0.05309149 0.05683617 0.06086996 0.0652142 0.069891160.07492395 0.08033633 0.08615245 0.09239652 0.0990924 0.106263040.11392986 0.12211193 0.13082504 0.14008063 0.14988449 0.160235370.17112343 0.18252859 0.1944188 0.20674834 0.21945618 0.232464640.2456783 0.2589835 0.27224841 0.28532401 0.29804594 0.310237480.32171351 0.33228555 0.34176766 0.34998293 0.35677032 0.361991280.36553585 0.36732769 0.36732769 0.36553585 0.36199128 0.356770320.34998293 0.34176766 0.33228555 0.32171351 0.31023748 0.298045940.28532401 0.27224841 0.2589835 0.2456783 0.23246464 0.219456180.20674834 0.1944188 0.18252859 0.17112343 0.16023537 0.149884490.14008063 0.13082504 0.12211193 0.11392986 0.10626304 0.09909240.09239652 0.08615245 0.08033633 0.07492395 0.06989116 0.06521420.06086996 0.05683617 0.05309149 0.04961567 0.04638952 0.043394970.04061509 0.03803403 0.03563701 0.03341025 0.031341 0.02941740.02762847 0.02596406 0.02441481 0.02297204]" "
The tower of Pisa--statistical significance test