Python Data Statistical Analysis

Source: Internet
Author: User
Keywords python data analysis python data analysis
1. Common function library
  The stats module and statsmodels package in the scipy package are commonly used data analysis tools in python. scipy.stats used to have a models submodule, which was later removed. This module was rewritten and became the now independent statsmodels package.

scipy's stats contains some basic tools, such as t-test, normality test, chi-square test and so on. statsmodels provides more systematic statistical models, including linear models, time series analysis, data sets, and graphs. Tools and more.

2. Normality test of small sample data
(1) Purpose

The Shapiro-Wilk test method (Shapiro-Wilk) is used to test whether a set of small sample data lines provided by the parameters conform to the normal distribution. The larger the statistic, the more the data conforms to the normal distribution. Larger W values often appear in small sample data. Need to look up the table to estimate its probability. Since the original hypothesis is that it conforms to the normal distribution, when the P value is less than the specified significance level, it means that it does not conform to the normal distribution.

Normality test is the first step in data analysis. Whether the data conforms to normality determines the subsequent use of different analysis and forecasting methods. When the data does not conform to the normal distribution, we can use different conversion methods to change the non-normality After the data is converted into a normal distribution, use the corresponding statistical method for the next step.

(2) Example

from scipy import stats
import numpy as np
 
np.random.seed(12345678)
x = stats.norm.rvs(loc=5, scale=10, size=80) # loc is the mean, scale is the variance
print(stats.shapiro(x))
# Running result: (0.9654011726379395, 0.029035290703177452)
(3) Result analysis

 Return the result p-value=0.029035290703177452, which is less than the specified significance level (usually 5%), then reject the hypothesis: x does not obey the normal distribution.

3. Test whether the sample serves a certain distribution
(1) Purpose

 Kolmogorov-Smirnov test (Kolmogorov-Smirnov test), to test whether the sample data obeys a certain distribution, only suitable for continuous distribution test. Use it to test the normal distribution in the following example.

(2) Example

from scipy import stats
import numpy as np
 
np.random.seed(12345678)
x = stats.norm.rvs(loc=0, scale=1, size=300)
print(stats.kstest(x,'norm'))
# Run result: KstestResult(statistic=0.0315638260778347, pvalue=0.9260909172362317)
(3) Result analysis

 Generate 300 random numbers that obey the N(0,1) standard normal distribution, use k-s to test whether the data obey the normal distribution, and put forward the hypothesis: x is from the normal distribution. The final returned result, p-value=0.9260909172362317, is greater than the specified significance level (usually 5%), then we cannot reject the hypothesis: x obeys a normal distribution. This is not to say that it must be correct that x obeys a normal distribution, but that there is insufficient evidence to prove that x does not obey a normal distribution. Therefore our hypothesis is accepted, that x obeys a normal distribution. If the p-value is less than the significance level we specify, we can definitely reject the hypothesis proposed, thinking that x must not obey the normal distribution, and this rejection is absolutely correct.

4. Test for homogeneity of variance
(1) Purpose

 Variance reflects the degree of deviation of a set of data from its mean value. The homogeneity of variance test is used to test whether there is a difference in the degree of deviation between two or more sets of data and its mean. It is also a prerequisite for many tests and algorithms.

(2) Example

from scipy import stats
import numpy as np
 
np.random.seed(12345678)
rvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
rvs2 = stats.norm.rvs(loc=25,scale=9,size=500)
print(stats.levene(rvs1, rvs2))
# Run result: LeveneResult(statistic=1.6939963163060798, pvalue=0.19337536323599344)
(3) Result analysis

 Return result p-value=0.19337536323599344, which is greater than the specified significance level (assumed to be 5%). It is considered that the two sets of data have homogeneous variance.

5. Graphic description relevance
(1) Purpose

The most commonly used two-variable correlation analysis is used as a graph to describe the correlation. The horizontal axis of the graph is one variable, and the vertical axis is another variable. Draw a scatter plot. From the graph, you can intuitively see the direction of the correlation and Strong and weak, linear positive correlation generally forms a graph from lower left to upper right; negative correlation is a graph from upper left to lower right, and some nonlinear correlations can also be observed from the graph.

(2) Example

import statsmodels.api as sm
import matplotlib.pyplot as plt
data = sm.datasets.ccard.load_pandas().data
plt.scatter(data['INCOMESQ'], data['INCOME'])


(3) Result analysis

 A clear positive correlation trend can be seen from the figure.

6. Correlation analysis of normal data
(1) Purpose

 Pearson correlation coefficient is a statistic that reflects the degree of linear correlation between two variables. It is used to analyze the correlation between two continuous variables in a normal distribution. It is often used to analyze the correlation between independent variables and between independent variables and dependent variables.

(2) Example

from scipy import stats
import numpy as np
 
np.random.seed(12345678)
a = np.random.normal(0,1,100)
b = np.random.normal(2,2,100)
print(stats.pearsonr(a, b))
# Running result: (-0.034173596625908326, 0.73571128614545933)
(3) Result analysis

The first value of the returned result is the correlation coefficient indicating the degree of linear correlation. Its value range is [-1,1]. The closer the absolute value is to 1, the stronger the correlation between the two variables. The closer the absolute value is to 0, the two The worse the correlation of each variable. When the two variables are completely uncorrelated, the correlation coefficient is 0. The second value is p-value. Statistically, when p-value<0.05, it can be considered that there is a correlation between the two variables.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.