I. Descriptive statistics
Descriptive statistics refers to the use of watchmaking and classification, graphs and Ji Yun General data to describe the concentration of data trends, discrete trends, skewness, kurtosis.
1. Missing value fill: Common methods: Culling method, mean value method, minimum neighbor method, ratio regression method, decision tree method.
2. Normality test: Many statistical methods require that the numerical values obey or approximate to obey the normal distribution, so it is necessary to test the normality before. Common methods: K-volume test of nonparametric test, P-q-q graph, W-test, and dynamic difference method.
Second, hypothesis test
1. Parameter test
The parameter test is a test of some main parameters (such as mean, percentage, variance, correlation coefficient, etc.) under the condition of the known general distribution (a requirement of the general normal distribution).
1) U test conditions: When the sample content n is large, the sample value conforms to the normal distribution
2) t test conditions of Use: When the sample content n is small, the sample value conforms to the normal distribution
A single sample t test: Infer that the sample from the total mean μ and a known population mean μ0 (often theoretical or standard value) there is no difference;
B paired sample T test: When the total mean is unknown, and the two samples can be paired, the two in the same pair are carry similar to the various conditions that may affect the processing effect;
C two independent sample T test: It is not possible to find two samples that are very similar in each aspect for pairing comparisons.
2. Non-parametric inspection
Nonparametric tests do not consider whether the overall distribution is known or not, and often do not target the overall parameters, but rather examine the overall hypothesis of some of the underlying assumptions, such as whether the move place of the population distribution is the same or whether the overall distribution is normal.
Application: The order type data, the distribution pattern of such data is generally unknown.
A although it is continuous data, but the overall distribution pattern is unknown or non-normal;
Although the B-body distribution is normal, the data is continuous type, but the sample capacity is very small, such as 10 or less;
The main methods include: Chi-square test, rank and test, two Tests, run-length test, K-quantity test, etc.
Three, reliability analysis
The credibility of the survey measurement, such as the authenticity of the questionnaire.
Classification:
1, external reliability: Different time measurement of the consistency of the scale, the common method of retest reliability
2, intrinsic reliability; whether each scale is measured to a single concept, and the consistency of the intrinsic terms that make up the two tables, the common method is half-confidence.
Iv. analysis of the list of tables
Used to analyze whether there is correlation between discrete variables or training variables.
For the two-dimensional table, chi-square can be tested, for three-dimensional table, can be Mentel-hanszel stratified analysis.
The table analysis also includes the Chi-square test of paired counting data and the related tests of sequential variables.
V. Correlation analysis
To study whether there is some kind of dependency relationship between the phenomena, and to discuss the related direction and degree of the phenomenon of the specific dependence.
1, single correlation: the correlation between the two factors is called single correlation, that is, the study only involves an independent variable and a dependent variable;
2, complex correlation: three or more than three factors related to the correlation is called complex correlation, that is, the study involves two or more than two independent variables and dependent variables;
3. Partial correlation: When a phenomenon is related to a variety of phenomena, the correlation between the two variables is called partial correlation when the other variables are assumed to be unchanged.
VI. Variance analysis
Conditions of Use: Each sample shall be a random sample independent of each other; The samples are from the normal distribution and the population variance is equal.
Classification
1. Single-Factor variance analysis: Only one factor in an experiment, or when there are multiple influencing factors, only the relationship between one element and the response variable is analyzed.
2, multi-factor has the interactive variance analysis: A XU experiment has several influential factors, analyzes the relationship between multiple influencing factors and response variables, and takes into account the relationship between several influencing factors.
3. Multi-factor non-reciprocal variance analysis: Analysis of the relationship between multiple influencing factors and response variables, but there is no relationship between influencing factors or ignoring the influence relationship.
4, co-square difference pray: The traditional analysis of variance has obvious drawbacks, can not control the analysis of some random factors, so that it affects the accuracy of the results. Covariance analysis is an analytic method combining linear regression and variance analysis to analyze the variance of the modified main effect after excluding the influence of the covariance.
Vii. regression analysis
Classification:
1, unary linear regression analysis: Only one argument x is related to the dependent variable y, both x and y must be continuous variables, because the variable y or its residuals must obey the normal distribution.
2. Multivariate linear regression analysis
Conditions of Use: analysis of the relationship between multiple independent variables and dependent variable y, both x and y must be continuous variables, because the variable y or its residuals must obey the normal distribution.
1) Change in the screening method: Select the optimal regression equation of the variable-range screening method including the full-transverse method (CP method), Stepwise regression method, the forward introduction method and the backward elimination method
2) Transverse type diagnosis method:
A residual test: The difference between the observed value and the estimated value is difficult from normal distribution
B Strong impact point judgment: The search method is generally divided into standard error method, Mahalanobis distance method
C co-linearity Diagnostics:
• Diagnostic methods: tolerance, variance enlargement factor method (also known as expansion coefficient vif), feature root determination method, conditional pointer ci, variance ratio
• Treatment methods: Increase sample capacity or select additional regression such as principal component regression, ridge regression, etc.
3. Logistic regression analysis
The linear regression model requires that the dependent variable is a continuous normal distribution variable, and that the independent variable and the dependent variable are linearly related, and the logistic regression model has no requirement for the distribution of the dependent variable, and is generally used when the dependent variable is discrete.
Classification:
The logistic regression model is conditional and non-conditional, the difference between conditional logistic regression model and non-conditional logistic regression model is that the estimation of parameters is using the conditional probability.
4. Other regression methods nonlinear regression, ordered regression, probit regression, weighted regression, etc.
Eight, cluster analysis
The sample individual or the indicator variable is classified according to its characteristics, and the statistical quantity of the similarity of things is found reasonably.
1, the nature of classification:
Q-Type cluster analysis: Classification of samples, also known as sample clustering using distance coefficient as a statistical measure of similarity, such as European distance, extreme distance, absolute distance, etc.
R-Type cluster Analysis: Classification of indicators, also known as index cluster analysis using similarity coefficient as a statistic to measure similarity, correlation coefficient, column contact number, etc.
2, Method Classification:
1) System Clustering method: Suitable for small sample cluster or index clustering, generally using system clustering method to cluster index, also known as hierarchical clustering
2) Stepwise Clustering method: Suitable for large sample cluster
3) Other clustering methods: Two-step clustering, K-mean clustering, etc.
Nine, discriminant analysis
1, discriminant analysis: According to a group of well-defined samples established discriminant function, so that the number of cases of false judgment, and then to a given new sample, judging it from which the overall
2. Difference from cluster analysis
1) cluster analysis can classify the sample 逬 and classify the index, and the discriminant analysis is only for the sample
2) cluster analysis in advance do not know the category of things, also do not know the categories, and discriminant analysis must know in advance the categories of things, but also know a few kinds of
3) cluster analysis does not need to classify the historical data, but directly classify the sample, and the discriminant analysis needs to classify the historical data to establish the discriminant function, then the sample can be classified
3, the classification:
1) Fisher discriminant Analysis Method:
To classify the distance as the criterion, that is, the shortest distance between the sample and the class, which is suitable for two kinds of discriminant;
To classify the probability as the criterion, that is, the probability of which class the sample belongs to is the most, which is suitable for
Applicable to multi-class discrimination.
2) Bayes discriminant analysis Method:
Bayes discriminant Analysis method is more perfect and advanced than Fisher discriminant analysis method, it can not only solve multi-class discriminant analysis, but also consider the distribution state of data in analysis, so it is generally more used.
X. Principal component Analysis
A set of indicators that palm each other off is transformed into a new set of indicator variables that are independent of each other, and a few new indicator variables can be used to synthesize the main information contained in the original multiple indicator variables.
Xi. Factor analysis
A multivariate statistical analysis method aimed at finding latent factors that are hidden in multivariate data, which cannot be directly observed but affect or dominate measurable variables, and estimate the extent of potential factors affecting measurable variables and the correlation between potential factors.
Comparison with principal component analysis:
Same: All can play the role of the intrinsic structure relationship of 済 multiple primitive variables
Different: Principal component analysis is about synthesizing the information of the original adaptation. and factor analysis is a more in-depth statistical method than principal component analysis, which explains the relationship between primitive variables.
Use:
1) Reduce the number of analysis variables
2) The original variables are classified by probing the correlation relation between the variables.
12. Time series Analysis
The statistical method of dynamic Data processing is used to study the statistical laws of random data series to solve practical problems. Time series usually consists of 4 elements: trend, seasonal change, cyclical fluctuation and irregular fluctuation.
Main methods: Moving average filter and exponential smoothing method, Arima transverse type, quantity arima transverse type, Arimax model, forward autoregressive transverse type, Arch family model
13. Survival Analysis
A statistical analysis method for studying the distribution of time-to-live and the relationship between time-of-life and related cable
1, including content:
1) Describe the survival process, that is, the study of survival time distribution law
2) Compare the survival process, that is, to study the distribution of two or more groups of survival time, and compare
3) Analysis of risk factors, that is to study the impact of risk factors on the survival process
4) Establish a mathematical model, the survival time and related risk factors of the dependence of a mathematical formula to express.
2. Method:
1) Statistical description: Including the survival time of the division, the median lifetime, the average, the survival function of the estimate, judge the survival time of the graph method, do not make any statistical inference to the analyzed data conclusions
2) Non-parametric test: Check the consistency of the survival curve of each level of the grouping variables, there is no requirement for the distribution of the survival time, and test the influence of the risk factors on the life time.
A Multiply positive Limit method (PL method)
B Life Table Method (LT method)
3) semi-parametric transverse regression analysis: Under certain assumptions, the regression equation of survival time with multiple risk factors is established, and the method is represented by Cox proportional risk regression analysis method.
4) Parametric model regression analysis: When the known time-to-live obeys the specific parameter transverse pattern, fitting the corresponding parameter model, more accurate analysis of the change rule between variables
14. Canonical Correlation analysis
Correlation analysis is generally used to analyze the relationship between two changes, and canonical correlation analysis is a statistical analysis method to analyze the correlation between two sets of changes (such as 3 academic Ability Index and 5 performance index of school performance).
The basic idea of canonical correlation analysis is similar to the basic idea of PCA, which transforms the multiple linear correlations between one set of variables and another set of variables into a simple linear correlation between a few pairs of synthetic variables, And this few pairs of variables contain linear correlation information that covers almost all the corresponding information contained in the original variable group.
15. r0c Analysis
The R0C curve is based on a series of different two classification methods (cutoff value or decision threshold). With true positive rate (sensitivity) as ordinate, false positive rate (1-specificity) is plotted on horizontal axis
Use:
1, r0c curve can easily detect any threshold value of the disease recognition ability
Use
2. Select the best diagnostic limit value. The r0c curve is closer to the upper left corner, the higher the accuracy of the test;
3, two or more different diagnostic tests on the ability to identify the disease, a share of the r0c curve under the area to reflect the accuracy of the diagnostic system.
16. Other Analytical methods
Multiple response analysis, distance division, project praying, correspondence division, decision tree Analysis, neural network, System equation, Monte Carlo simulation, etc.
Summary of data analysis methods