Data analysis probability and statistical basis

Source: Internet
Author: User

I. Overview of data analysis 1. The concept of data analysisData analysis is the analysis of data that extracts the information you want from a large stack of data. Comparative Professional Answer: Data analysis is a targeted collection, processing, collation of data, and the use of statistics, mining technology analysis and interpretation of data science and art. A more objective answer: from an industry perspective, data analysis is a process of collecting, collating, processing and analysing data, and refining valuable information, based on an industry purpose.Understand the three aspects of data analysis: Objectives, methods, results.
2. Data mining Concept Data mining is the non-trivial process of identifying effective, novel, potentially useful, and ultimately understandable patterns from a large, incomplete, noisy, fuzzy, random data set. It is a wide range of interdisciplinary disciplines, including machine learning, mathematical statistics, neural networks, databases, pattern recognition, rough sets, fuzzy mathematics and other related technologies.
3. Nature of Business data analysis and prediction

Data analytics and business are tightly integrated to meet the needs of business decisions. Anticipate future developments, identify problems early, optimize your business, and make optimal decision-making options.

4.8 Levels of data analysisGeneral reportsAd hoc queriesMultidimensional AnalysisAlertsStatistical analysisForecastPredictive modelingOptimization
5. Big data expands on traditional small data(1) Big data and small data, the difference between large numbers of data and change is to abandon the desire for causation, instead of focusing on related relations. That is, just know "what" and don't need to know "why". This has revolutionized the practice of human thought over the centuries, presenting new challenges to human cognition and to the way the world communicates.
(2). Another important difference is that, in terms of use, past data has largely stayed in the state of the past, talking about the data, actually using past data to illustrate the past, while the core of big data is prediction. Big data will create unprecedented quantifiable dimensions for human life. Make the data from the original stay in the description of the past into a drive now, I think the effect of forecasting on the enterprise from two directions:
A. Macro is the forecast of the trend, to the enterprise big potential analysis,
B. Micro is the accurate analysis of the individual, to the enterprise to do personalized precision marketing
(3). From the structure, big data is more embodied in the integration of the massive unstructured data itself and the processing method.
Big data and small data judgment principles:
A. The amount of data
B. Types and formats of data
C. Speed of data processing
D. Complexity of data
(4). Analysis based on different, big data is only on the basis of large-scale data can be done, and this need to have from quantitative to qualitative change process, but also because of technological innovation in the method lay the foundation, and the use of the Internet to develop new ways of life and work, so that information can be accumulated to the extent of changes, and many things on the basis of Complete the

6. Clarifying the significance of data analysis objectivesThe key to data analysis is to set goals, professionally called "targeted."The premise of data analysis is to have clear goals. The key to the success of data analysis is to grasp the purpose of data analysis. Only with a deep understanding of the purpose of data analysis can we organize a complete analysis framework and ideas, because the analysis method chosen according to the different analysis objectives is different.
7. The process of data analysisDefine the purpose and content of the analysis-data collection-Data preprocessing-data analysis-data presentation-writing reports
8. Differences and linkages between statistical analysis and data miningContact: All from statistical basic theory, data mining also often use statistical analysis methods, such as principal component analysis, regression analysis Difference: Data mining is the extension and development of statistical analysis method. Statistical analysis often requires assumptions or judgments, and then uses data analysis techniques to verify that assumptions are true.  Data mining does not need to make any assumptions or judgments about the intrinsic relationships of the data, but rather allows the algorithms in the data Mining tool to automatically find the relationships and laws hidden in the data. The application of statistical analysis in prediction often manifests itself as a function relation, and data mining sometimes does not produce definite function relation from the result, and does not know which variables work and lacks explanatory, such as "Neural Network".In practical application, statistical analysis and data mining can not be separated.
9. CRISP-DMCRISP-DM (Cross-industry standard processes for data mining) is the "cross-industry data mining standards process".The CRISP-DM model provides a complete process description for a KDD project. The model divides a KDD project into 6 different, but not completely unchanged stages of the sequence。is a data mining project relationship methodology.
Ten. SemmaSAS Company's data mining project implementation methodology. The data preparation and modeling links in the Crisp-dm method are extended.sample─ Data Sampling
explore─ Data features exploration, analysis and processing
modify─ problem clarifying, data adjustment and technology selection
Research and development of model─ model and discovery of knowledge
Comprehensive interpretation and evaluation of assess─ model and knowledge

11. Roles and responsibilities of different people in data analysisA large data analysis project involves industry academic experts, business experts, data analysts, and IT staff. Among them, business experts provide business objectives, business understanding, and provide current marketing and feedback information, academic experts to provide relevant areas of research and development of the latest, dimensional analysis, data analysts for data understanding, cleaning and modeling, IT staff to provide data support and project implementation support.


II. Descriptive statistical analysis 1. Measurement scale of dataFixed-class scale, fixed-order scale, fixed-distance scale, and constant-ratio scale


2. Trends in the concentration of data The concentration trend in statistics refers to the degree to which a set of data moves closer to a central value, which reflects the location of a set of data center points. A centralized trend measure is the search for a representative or central value of the data level.Common indicators: Average, median (number of digits), majorityAverages are susceptible to extreme values, and the median and majority are unaffected by extreme values.
3. Trend of data in the distanceThe trend in statistics refers to the degree to which a group of data is dispersed to a central value, which reflects the degree to which each data is away from the center point. The representative degree of the concentration trend measure is explained from the side.Common indicators: Extreme, four, mean difference, variance, standard deviation, discrete coefficients
Extreme difference = maximum-minimum valueQuarter distance = (third four min.-first four-digit)/2Average differenceVariance Standard deviation (General 68% in one standard deviation, 95% in 2 standard deviation, remaining 5% away)Discrete coefficients (comparison of two groups of samples from the degree of size: the smaller the dispersion coefficient, the more representative of the average value)
4. Data distribution Patterns

A set or series of numbers that fall in the shape of a coordinate chart. For example: normal distribution.

The measurement of data distribution pattern is mainly measured by normal distribution.

Indicators: skewness, kurtosis

(1) Partial state (asymmetry of data distribution)Positive bias: average > Majority > Median negative bias: average < number of people < medianSkewness: There are many methods of calculation, in Excel the formula isThe sk=0 distribution is symmetricalSk>0 positive bias, the higher the value, the higher the degree of positive deviationsk<0 negative bias state, the lower the value, the higher the negative partial degree
(2) Peak degreeKurtosis factor: In Excel, the calculation formula isStandard kurtosis of K=0K<0 Flat Peak DegreeK>0 Tip Peak Degree
(3) When moderately biased, the distance between the median and the average is approximately equal to One-third of the distance between the majority and the average.Two known, one can deduce another.
5. Chart common: Bar chart fan Chart line chart box plot histogram of stem and leaf graphs



Iii. The sample is estimated at 1. Random test, random event, random variable concept random test: Random Events: Random Event-a set of random variables consisting of some basic results of a random phenomenon: A variable used to represent the result of a random phenomenon
2. Overall and sample concept overall: the whole of the research object is called the overall

Sample: Generally, from the overall study in accordance with a certain rule of extracting n individuals to observe or experiment, which n individuals called the overall sample

3. The theoretical basis of sampling estimation is the sampling information obtained by sampling survey, and according to the general law of random variables revealed by probability theory, a statistical analysis method is used to estimate some quantitative characteristics of the population. The sampling estimation is based on the law of large numbers and the central limit theorem. The law of large numbers demonstrates the tendency of sampling averages to approximate the general average. The central limit theorem proves the probability problem of the difference between the sampling mean and the population average in a certain range.
4. Normal distribution and three major distributions (1) Normal distribution

Characteristics of a normal distribution: a. The normal distribution has two parameters, namely mean μ and standard deviation σ, which can be recorded asN (μ,σ2):The mean μ determines the central position of the normal curve, and the standard deviation σ determines the steep or flat degree of the normal curve. The smaller the σ, the steeper the curve, and the larger the σ the more flattened the curve. B. U transform: In order to facilitate the description and application, the normal variable is often used for data conversion. μ is the positional parameter of the normal distribution, which describes the centralized trend position of the normal distribution. The normal distribution takes x=μ as the axis of symmetry and is completely symmetrical. The mean, median, and majority of the normal distribution are equal to μ. C.σ describes the dispersion degree of distribution of normal distribution data, the larger the σ, the more dispersed the data distribution, the smaller the σ, the more concentrated the data distribution. Also known as a normal distribution of shape parameters, σ larger, the more flat curve, conversely, σ smaller, the more tall curve. D.3σ principle: P (μ-σ<x≤μ+σ) =68.3%p (μ-2σ<x≤μ+2σ) =95.4%p (μ-3σ<x≤μ+3σ) =99.7%
(2) Chi-square distribution
Chi-square distribution is a kind of distribution derived from normal distribution. The definition is that the sum of squares of several random variables is also subject to a distribution, i.e. chi-squared distribution.


(3) T distribution

(3) F distribution

Use of three major distributions:
Chi-square distribution: Often used for goodness of fit test T-distribution: More used for the estimation and testing of proportions, for variance analysis, covariance distribution and regression analysis T distribution: in the case of insufficient information, can only use the t distribution, for example, in the case of the global variance is not known, the general mean of the estimation and inspection of common T statistics
5. Sample organization simple random sampling stratified sampling sampling phase sampling
6. Determine the necessary sample capacity the necessary sample capacity is the number of sample individuals that should be extracted at least to make the sampling error not exceed the given allowable error range. If the number of samples is too large, although the error will be reduced, but will increase the workload of the investigation, time-consuming and laborious, reflect the superiority of sampling, the sampling number is too small, the error becomes larger, lost the significance of sampling survey. So choose the appropriate number of samples.
7. Factors affecting the total variance of the necessary sample capacity (standard deviation σ) allowable error range Confidence (1-α) sampling form of sampling method
8. Sampling average error sampling mean error is the standard deviation of the sampling average. The average error degree of the sampling average and the overall average is reflected. The standard deviation of the average of multiple samples in the population.

9. The characteristics of point estimation and interval estimation and the estimation of the parameters of merit and shortcoming is to infer the unknown parameters in the overall distribution by the sample or estimate some functions of the unknown parameters. Two basic types of parameter estimation: Point estimation, Interval estimation
Point estimation: It is a method of estimating the surface of a point.                Features: Design the sample index according to the structure of the overall index, and take the actual value of the sample index as the estimate value of the overall index directly. Advantages: simple, intuitive principleDisadvantage: There is no error indicating the point estimation, and it does not indicate the degree of the concept of error within a certain range.
Interval estimation: The interval estimation must have both the estimated value, the sampling error range and the probability guarantee degree of three elements.Characteristics: The estimated value of the overall parameter is not given directly, but the upper and lower limit of the total parameter is estimated, that is, the range of the total parameter exists, and the guarantee of certain probability is given.Advantages: Clear accuracy and reliabilityCons: Accuracy and reliability are conflicting. The confidence interval of the parameters is obtained, the reliability is ensured and the accuracy is improved.
10. Interval estimation of the overall average of the total average and number of interval estimation methods:




Interval estimation of the total percentage:


11. Significance and application of the central limit theoremThe core of the central limit theorem is that as long as n is large enough, the independent and distributed random variables and normalization can be treated as normal variables, so it can be used to solve a lot of practical problems, and it also helps to explain why many natural groups experience frequency shows the bell-shaped curve of this noteworthy fact, Thus the normal distribution becomes the most important distributing in probability theory, which lays the primary merit of the central limit theorem. Secondly, the central limit theorem plays an important role in other disciplines. For example, the parameter (interval) estimation, hypothesis test, sampling survey, etc. further, the central limit theorem paves the way for the application of mathematical statistics in statistics, and the key to infer the population with samples is to grasp the sampling distribution of sample eigenvalues, and the central limit theorem indicates that as long as the sample capacity is large enough, The sample eigenvalue of unknown population is approximate to normal distribution. Thus, as long as a large number of observation methods to obtain enough random sample data, almost all the mathematical statistics can be used to deal with the problem of statistics, which from another aspect also indirectly open up a statistical method of the field, which in the modern inferential statistical methodology in the dominant position.

12. The possible number of samples may refer to the total number of samples that may be drawn from the overall red. is related to the extraction method and sample capacity. Repeated sampling non-repeating sampling

Iv. hypothesis testing 1. The basic concept of hypothesis testing and the hypothesis test of basic thought: the process of analyzing the overall index is achieved by testing the sample scale from the general point of view. The purpose is to analyze whether there is significant difference between the sample index and the overall index.
Basic idea: (1) The Counter-method (2) Small probability event.To make some assumptions about the overall index, take the contradiction idea, according to the general hypothesis, and according to the data obtained from the sample statistics, we can get the error phenomenon of the small probability event in a sample, and then make a refusal judgment on the hypothesis of the overall index.
2. The role of hypothesis testing in data analysis is unknown in the general context, with historical experience to speculate on the overall, the use of sample statistics to test the results of speculation. The principle and method of hypothesis testing is one of the cornerstones of data analysis.
3. The basic steps of hypothesis testing (1) Establish the original hypothesis (2) Select test Statistics (3) Find Reject field (4) Calculate the value of the sample statistic, compare it with the threshold, make a judgment
4. The relationship between hypothesis testing and interval estimationHypothesis testing is the test of the sample data from the assumptions of the population, and the interval estimation is based on the sample data, estimating the overall parameters, but the two are essentially the same.At the same level of significance, the results of hypothesis testing and interval estimation are consistent.
5. Two types of errors in the hypothesis test (1) reject the original hypothesis (2) in case the original hypothesis is true, accept the original hypothesis
Given a significant level of alpha in advance, the probability that the first type of error is committed does not exceed alpha. When the sample capacity is certain, the probability of two kinds of errors occurring is negatively correlated. Generally, the probability of the first type of error occurs, in general the alpha value is 0.01, 0.05, 0.1 and so on.
6. Hypothesis test using P-value (1) meaning of P-valueThe P-value is the probability that the sample observations or more extreme results appear when the original hypothesis is true. If the P-value is very small, indicating that the probability of occurrence of the original hypothesis is very small, and if there is, according to the small probability principle, we have reason to reject the original hypothesis, the lower the P-value, we reject the original hypothesis more fully. In summary, the smaller the P-value, the more pronounced the result. But whether the test results are "significant", "moderate significant" or "highly significant" requires ourselves to address the magnitude of P-value and the actual problem.

(2) Calculation of P-valueGenerally, x is used to represent the statistic of the test, when the H0 is true, the value of the statistic can be calculated from the sample data, and the P value can be obtained according to the specific distribution of the test statistic X. Specifically:
The P-value of the left-hand test is the probability that the test statistic X is less than the sample statistic C, that is: p = p{X < C}
The P-value of the right-hand test is the probability that the test statistic X is greater than the sample statistic C: p = p{X > C}
The P-value for the two-sided test is twice times the probability of the test statistic X falling in the sample statistic value C for the end area of the endpoint: p = 2p{X > C} (when C is at the right end of the distribution curve) or P = 2p{x< C} (when C is at the left end of the distribution curve). If x obeys normal distribution and t distribution, its distribution curve is about longitudinal axis symmetry, so its p value can be expressed as P = p{| X| > C}.
(3) Use P-value to judge
After the P value is calculated, the results of the test can be made by comparing the given significance level α with P value:
If the P-value is α>, the original hypothesis is rejected at a significant level of alpha.
If the Α≤p value, the original hypothesis is accepted at a significant level of alpha.
In practice, when α= P value, that is, the value of the statistic C is exactly equal to the critical value, for the sake of prudence, can increase the sample capacity, re-sampling inspection.

7. Z Test StatisticsZ-Test, also known as U-Test. When the original hypothesis is established, the test statistic obeys the standard normal distribution. Generally used for large samples (n>30).(1) test of single normal population average(2) test of the difference of two normal population averagesApplicable conditions:(1) The total number of known average;
(2) The sample mean and the standard error of the sample can be obtained;
(3) The sample is from a normal or approximate normal population.




8. T test statistics
When the original hypothesis is established, the test statistic obeys t distribution.

Data analysis probability and statistical basis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.