Python Data Statistical Analysis

Last Update:2020-09-28 Source: Internet

Author: User

Keywords python data analysis python data analysis

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Common function library
The stats module and statsmodels package in the scipy package are commonly used data analysis tools in python. scipy.stats used to have a models submodule, which was later removed. This module was rewritten and became the now independent statsmodels package.

scipy's stats contains some basic tools, such as t-test, normality test, chi-square test and so on. statsmodels provides more systematic statistical models, including linear models, time series analysis, data sets, and graphs. Tools and more.

2. Normality test of small sample data
(1) Purpose

The Shapiro-Wilk test method (Shapiro-Wilk) is used to test whether a set of small sample data lines provided by the parameters conform to the normal distribution. The larger the statistic, the more the data conforms to the normal distribution. Larger W values often appear in small sample data. Need to look up the table to estimate its probability. Since the original hypothesis is that it conforms to the normal distribution, when the P value is less than the specified significance level, it means that it does not conform to the normal distribution.

Normality test is the first step in data analysis. Whether the data conforms to normality determines the subsequent use of different analysis and forecasting methods. When the data does not conform to the normal distribution, we can use different conversion methods to change the non-normality After the data is converted into a normal distribution, use the corresponding statistical method for the next step.

(2) Example

from scipy import stats
import numpy as np

np.random.seed(12345678)
x = stats.norm.rvs(loc=5, scale=10, size=80) # loc is the mean, scale is the variance
print(stats.shapiro(x))
# Running result: (0.9654011726379395, 0.029035290703177452)
(3) Result analysis

Return the result p-value=0.029035290703177452, which is less than the specified significance level (usually 5%), then reject the hypothesis: x does not obey the normal distribution.

3. Test whether the sample serves a certain distribution
(1) Purpose

Kolmogorov-Smirnov test (Kolmogorov-Smirnov test), to test whether the sample data obeys a certain distribution, only suitable for continuous distribution test. Use it to test the normal distribution in the following example.

(2) Example

from scipy import stats
import numpy as np

np.random.seed(12345678)
x = stats.norm.rvs(loc=0, scale=1, size=300)
print(stats.kstest(x,'norm'))
# Run result: KstestResult(statistic=0.0315638260778347, pvalue=0.9260909172362317)
(3) Result analysis

Generate 300 random numbers that obey the N(0,1) standard normal distribution, use k-s to test whether the data obey the normal distribution, and put forward the hypothesis: x is from the normal distribution. The final returned result, p-value=0.9260909172362317, is greater than the specified significance level (usually 5%), then we cannot reject the hypothesis: x obeys a normal distribution. This is not to say that it must be correct that x obeys a normal distribution, but that there is insufficient evidence to prove that x does not obey a normal distribution. Therefore our hypothesis is accepted, that x obeys a normal distribution. If the p-value is less than the significance level we specify, we can definitely reject the hypothesis proposed, thinking that x must not obey the normal distribution, and this rejection is absolutely correct.

4. Test for homogeneity of variance
(1) Purpose

Variance reflects the degree of deviation of a set of data from its mean value. The homogeneity of variance test is used to test whether there is a difference in the degree of deviation between two or more sets of data and its mean. It is also a prerequisite for many tests and algorithms.

(2) Example

from scipy import stats
import numpy as np

np.random.seed(12345678)
rvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
rvs2 = stats.norm.rvs(loc=25,scale=9,size=500)
print(stats.levene(rvs1, rvs2))
# Run result: LeveneResult(statistic=1.6939963163060798, pvalue=0.19337536323599344)
(3) Result analysis

Return result p-value=0.19337536323599344, which is greater than the specified significance level (assumed to be 5%). It is considered that the two sets of data have homogeneous variance.

5. Graphic description relevance
(1) Purpose

The most commonly used two-variable correlation analysis is used as a graph to describe the correlation. The horizontal axis of the graph is one variable, and the vertical axis is another variable. Draw a scatter plot. From the graph, you can intuitively see the direction of the correlation and Strong and weak, linear positive correlation generally forms a graph from lower left to upper right; negative correlation is a graph from upper left to lower right, and some nonlinear correlations can also be observed from the graph.

(2) Example

import statsmodels.api as sm
import matplotlib.pyplot as plt
data = sm.datasets.ccard.load_pandas().data
plt.scatter(data['INCOMESQ'], data['INCOME'])

(3) Result analysis

A clear positive correlation trend can be seen from the figure.

6. Correlation analysis of normal data
(1) Purpose

Pearson correlation coefficient is a statistic that reflects the degree of linear correlation between two variables. It is used to analyze the correlation between two continuous variables in a normal distribution. It is often used to analyze the correlation between independent variables and between independent variables and dependent variables.

(2) Example

from scipy import stats
import numpy as np

np.random.seed(12345678)
a = np.random.normal(0,1,100)
b = np.random.normal(2,2,100)
print(stats.pearsonr(a, b))
# Running result: (-0.034173596625908326, 0.73571128614545933)
(3) Result analysis

The first value of the returned result is the correlation coefficient indicating the degree of linear correlation. Its value range is [-1,1]. The closer the absolute value is to 1, the stronger the correlation between the two variables. The closer the absolute value is to 0, the two The worse the correlation of each variable. When the two variables are completely uncorrelated, the correlation coefficient is 0. The second value is p-value. Statistically, when p-value<0.05, it can be considered that there is a correlation between the two variables.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More