1. Common function library
The stats module and statsmodels package in the scipy package are commonly used
data analysis tools in python. scipy.stats used to have a models submodule, which was later removed. This module was rewritten and became the now independent statsmodels package.
scipy's stats contains some basic tools, such as t-test, normality test, chi-square test and so on. statsmodels provides more systematic statistical models, including linear models, time series analysis, data sets, and graphs. Tools and more.
2. Normality test of small sample data
(1) Purpose
The Shapiro-Wilk test method (Shapiro-Wilk) is used to test whether a set of small sample data lines provided by the parameters conform to the normal distribution. The larger the statistic, the more the data conforms to the normal distribution. Larger W values often appear in small sample data. Need to look up the table to estimate its probability. Since the original hypothesis is that it conforms to the normal distribution, when the P value is less than the specified significance level, it means that it does not conform to the normal distribution.
Normality test is the first step in
data analysis. Whether the data conforms to normality determines the subsequent use of different analysis and forecasting methods. When the data does not conform to the normal distribution, we can use different conversion methods to change the non-normality After the data is converted into a normal distribution, use the corresponding statistical method for the next step.
(2) Example
from scipy import stats
import numpy as np
np.random.seed(12345678)
x = stats.norm.rvs(loc=5, scale=10, size=80) # loc is the mean, scale is the variance
print(stats.shapiro(x))
# Running result: (0.9654011726379395, 0.029035290703177452)
(3) Result analysis
Return the result p-value=0.029035290703177452, which is less than the specified significance level (usually 5%), then reject the hypothesis: x does not obey the normal distribution.
3. Test whether the sample serves a certain distribution
(1) Purpose
Kolmogorov-Smirnov test (Kolmogorov-Smirnov test), to test whether the sample data obeys a certain distribution, only suitable for continuous distribution test. Use it to test the normal distribution in the following example.
(2) Example
from scipy import stats
import numpy as np
np.random.seed(12345678)
x = stats.norm.rvs(loc=0, scale=1, size=300)
print(stats.kstest(x,'norm'))
# Run result: KstestResult(statistic=0.0315638260778347, pvalue=0.9260909172362317)
(3) Result analysis
Generate 300 random numbers that obey the N(0,1) standard normal distribution, use k-s to test whether the data obey the normal distribution, and put forward the hypothesis: x is from the normal distribution. The final returned result, p-value=0.9260909172362317, is greater than the specified significance level (usually 5%), then we cannot reject the hypothesis: x obeys a normal distribution. This is not to say that it must be correct that x obeys a normal distribution, but that there is insufficient evidence to prove that x does not obey a normal distribution. Therefore our hypothesis is accepted, that x obeys a normal distribution. If the p-value is less than the significance level we specify, we can definitely reject the hypothesis proposed, thinking that x must not obey the normal distribution, and this rejection is absolutely correct.
4. Test for homogeneity of variance
(1) Purpose
Variance reflects the degree of deviation of a set of data from its mean value. The homogeneity of variance test is used to test whether there is a difference in the degree of deviation between two or more sets of data and its mean. It is also a prerequisite for many tests and algorithms.
(2) Example
from scipy import stats
import numpy as np
np.random.seed(12345678)
rvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
rvs2 = stats.norm.rvs(loc=25,scale=9,size=500)
print(stats.levene(rvs1, rvs2))
# Run result: LeveneResult(statistic=1.6939963163060798, pvalue=0.19337536323599344)
(3) Result analysis
Return result p-value=0.19337536323599344, which is greater than the specified significance level (assumed to be 5%). It is considered that the two sets of data have homogeneous variance.
5. Graphic description relevance
(1) Purpose
The most commonly used two-variable correlation analysis is used as a graph to describe the correlation. The horizontal axis of the graph is one variable, and the vertical axis is another variable. Draw a scatter plot. From the graph, you can intuitively see the direction of the correlation and Strong and weak, linear positive correlation generally forms a graph from lower left to upper right; negative correlation is a graph from upper left to lower right, and some nonlinear correlations can also be observed from the graph.
(2) Example
import statsmodels.api as sm
import matplotlib.pyplot as plt
data = sm.datasets.ccard.load_pandas().data
plt.scatter(data['INCOMESQ'], data['INCOME'])
(3) Result analysis
A clear positive correlation trend can be seen from the figure.
6. Correlation analysis of normal data
(1) Purpose
Pearson correlation coefficient is a statistic that reflects the degree of linear correlation between two variables. It is used to analyze the correlation between two continuous variables in a normal distribution. It is often used to analyze the correlation between independent variables and between independent variables and dependent variables.
(2) Example
from scipy import stats
import numpy as np
np.random.seed(12345678)
a = np.random.normal(0,1,100)
b = np.random.normal(2,2,100)
print(stats.pearsonr(a, b))
# Running result: (-0.034173596625908326, 0.73571128614545933)
(3) Result analysis
The first value of the returned result is the correlation coefficient indicating the degree of linear correlation. Its value range is [-1,1]. The closer the absolute value is to 1, the stronger the correlation between the two variables. The closer the absolute value is to 0, the two The worse the correlation of each variable. When the two variables are completely uncorrelated, the correlation coefficient is 0. The second value is p-value. Statistically, when p-value<0.05, it can be considered that there is a correlation between the two variables.