The tower of Pisa--statistical significance test

Source: Internet
Author: User
Tags statsmodels

Dataset

The Leaning Tower of Pisa is one of Italy's largest tourist attractions. Over the past hundreds of years the tower has been slowly leaning aside, culminating in a 5.5-degree tilt angle that deviates nearly 3 meters at the top level. The annual data pisa.csv documents the tilt of the measuring tower from 1975 to 1987, where lean represents the angle of deviation. In this task, we will try to use linear regression to estimate the tilt rate and interpret its coefficients and statistics.

                         "lean": [2.9642, 2.9644, 2.9656, 2.9667, 2.9673, 2.9688, 2.9696,                                   2.9698, 2.9713, 2.9717, 2.9725, 2.9742, 2.9757]})print(pisa)‘‘‘      lean  year0   2.9642  19751   2.9644  19762   2.9656  19773   2.9667  19784   2.9673  19795   2.9688  19806   2.9696  19817   2.9698  19828   2.9713  19839   2.9717  198410  2.9725  198511  2.9742  198612  2.9757  1987‘‘‘plt.scatter(pisa["year"], pisa["lean"])

Fit the Linear Model

From the scatter plot we can see that the curve can be used to fit the data nicely. Before we used linear regression to analyze the quality of wine and the stock market, but in this task, we will learn how to understand the key statistical concepts. statsmodels is a library of rigorous statistical analysis in Python, and for linear models, Statsmodels provides a sufficient number of statistical methods and appropriate evaluation methods. SM. The OLS class is used to fit the linear model, and the optimization method is the least squares. OLS () does not automatically add a intercept to the model, but you can add a column of properties yourself so that the value is 1 to produce the intercept.

Import Statsmodels.api as Smy = pisa.lean # targetx = pisa.year # featuresx = Sm.add_constant (X) # Add a column of 1 s as the constant term# OLS--Ordinary Least squares fitlinear = Sm. OLS (Y, X) # Fit Modellinearfit = Linear.fit () print (Linearfit.summary ())" "OLS Regression Results ======================================== ======================================Dep. Variable:lean r-squared:0.988model:ols Adj.                R-squared:0.987method:least Squares f-statistic:904.1date: Mon, APR Prob (f-statistic): 6.50e-12time:13:30:20 Log-likelihood : 83.777No.                            Observations:13 AIC: -163.6df residuals:11 BIC: -162.4DF model:1covariance Type:nonrobust ======================================== ======================================Coef std Err t p>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const         1.1233 0.061 18.297 0.000 0.988 1.258year 0.0009 3.1e-05 30.069 0.000                        0.001 0.001==============================================================================omnibus:                0.310 Durbin-watson:1.642prob (Omnibus): 0.856 Jarque-bera (JB): 0.450skew:0.094 Prob (JB): 0.799kurtosis:2.1                     Cond. No. 1.05E+06==============================================================================WARNINGS:[1] Standard Errors assume the covariance matrix of the Errors is correctly specified. [2] The condition number is large, 1.05e+06. This might indicate, there arestrong multicollinearity or other numerical problems. ' 
Define A Basic Linear Model
    • When printing summary, there is a lot of information about the model, in order to find out these statistical indicators, we need to carefully study the standard linear regression model, the following model of EI is the difference between the predicted value and the real value, meaning that our default model error is normally distributed, the average value is 0.:

    • Calculate the residuals for the Model (residuals):

# Our predicted values of yyhat = linearfit.predict(X)print(yhat)‘‘‘[ 2.96377802  2.96470989  2.96564176  2.96657363  2.96750549  2.96843736  2.96936923  2.9703011   2.97123297  2.97216484  2.9730967   2.97402857  2.97496044]‘‘‘residuals = yhat - y‘‘‘‘pandas.core.series.Series‘>)0    -0.0004221     0.0003102     0.0000423    -0.0001264     0.0002055    -0.0003636    -0.0002317     0.0005018    -0.0000679     0.00046510    0.00059711   -0.00017112   -0.000740Name: lean, dtype: float64‘‘‘
Histogram of residuals
    • We used the histogram (histograms) to show the distribution of the data, and now we can also show the distribution of the residuals to see if it satisfies the normal distribution (there are many statistical tests to check the normal distribution):
plt.hist(residuals, bins=5)

    • Since there are only 13 samples of our dataset, this histogram does not make much sense, although there are 4 samples in the middle of the highest
Sum of Squares

Statistical measurements of many linear regression models are dependent on three square measurements: Error (SSE), Regression sum of squares (RSS), and total Sum of squares (TSS).

    • Error (SSE): Sum of squares of difference between true and predicted values

    • Regression Sum of squares (RSS): The sum of squares of the difference of the mean of the predicted and true values, where the mean is the mean value of the true value. If you set the forecast value to the mean of the observation, the RSS is very low, but that doesn't make any sense. Instead is a big RSS and a small SSE represents a good model.

    • Total Sum of squares (TSS): The sum of squares between observations and the mean of the observed values is probably the variance of the training set.

      Tss=rss+sse: variance of total data = variance of the Model + variance of the residuals

importas np# sum the (predicted - observed) squaredSSE = np.sum((yhat-y.values)**2)‘‘‘1.9228571428562889e-06‘‘‘# Average yybar = np.mean(y.values)# sum the (mean - predicted) squaredRSS = np.sum((ybar-yhat)**2)‘‘‘0.00015804483516480448‘‘‘# sum the (mean - observed) squaredTSS = np.sum((ybar-y.values)**2)‘‘‘0.00015996769230769499‘‘‘print(TSS-RSS-SSE)‘‘‘3.42158959043e-17‘‘‘
R-squared
    • Linear judgment (coefficient of determination) is also called r-squared, which is used to measure linear dependence. It is a number used to tell us the variance of the model in the total variance of the data:

    • Before mentioning a low SSE and a high RSS representation of a good model fitting, this r-squared represents this meaning, between 0 and 1.
R2 = RSS/TSSprint(R2)‘‘‘0.987979715684‘‘‘
T-distribution

Statistical tests indicate that the tilt of the tower is related to the year, and a common statistical significance test is student T-test. The basis of this test is the T distribution. Similar to normal distribution, they are bell-shaped but have a lower peak. The T test is a test method for the two mean difference in small samples (sample size less than 30). It uses the T distribution theory to infer the probability of the difference occurring, thus determining whether the difference between the two averages is significant.

From scipy.stats import t# all values between-3 and 3x = Np.linspace ( -3,3,100) # Compute the PDF with 3 degrees of freedom Print (T.pdf (x=x, df=3))" "[0.02297204 0.02441481 0.02596406 0.02762847 0.0294174 0.0313410.03341025 0.03563701 0.03803403 0.04061509 0.04339497 0.046389520.04961567 0.05309149 0.05683617 0.06086996 0.0652142 0.069891160.07492395 0.08033633 0.08615245 0.09239652 0.0990924 0.106263040.11392986 0.12211193 0.13082504 0.14008063 0.14988449 0.160235370.17112343 0.18252859 0.1944188 0.20674834 0.21945618 0.232464640.2456783 0.2589835 0.27224841 0.28532401 0.29804594 0.310237480.32171351 0.33228555 0.34176766 0.34998293 0.35677032 0.361991280.36553585 0.36732769 0.36732769 0.36553585 0.36199128 0.356770320.34998293 0.34176766 0.33228555 0.32171351 0.31023748 0.298045940.28532401 0.27224841 0.2589835 0.2456783 0.23246464 0.219456180.20674834 0.1944188 0.18252859 0.17112343 0.16023537 0.149884490.14008063 0.13082504 0.12211193 0.11392986 0.10626304 0.09909240.09239652 0.08615245 0.08033633 0.07492395 0.06989116 0.06521420.06086996 0.05683617 0.05309149 0.04961567 0.04638952 0.043394970.04061509 0.03803403 0.03563701 0.03341025 0.031341 0.02941740.02762847 0.02596406 0.02441481 0.02297204]" "

The tower of Pisa--statistical significance test

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.