Data Analysis Encyclopedia gives an accurate definition: it refers to the process of analyzing a large amount of collected data with appropriate statistical analysis methods, extracting useful information and forming conclusions, and then conducting detailed research and generalization of the data.
In short, it is the process of
collecting and analyzing data with a purpose to make it information.
Data analysis process
1. Exploratory data analysis
The data initially obtained is messy, and the data is integrated in the form of charts to find the relationship between the data.
2. Model selection analysis
Through exploratory data analysis, one type or even multiple types of data models can be summarized, and certain models can be further analyzed by re-integrating the models.
3. Inference analysis
Generally, mathematical statistics are used to make inferences about the reliability and accuracy of the model or estimate.
Data analysis process
1. Identify information needs
Identifying information needs is the primary condition to ensure the effectiveness of the data analysis process, and can provide clear goals for collecting and analyzing data.
2. Data collection
The significance of understanding data collection is to truly understand the original appearance of the data, including the time, conditions, format, content, length, and restrictions of the data generated. Help data analysts control the data production and collection process more targeted, avoid data problems caused by violating data collection rules; at the same time, the knowledge of data collection logic increases the data analyst’s understanding of data, especially the data in the data. Abnormal changes.
In the data collection stage, data analysts need to know more about the abnormal conditions in the data production and collection process, which can largely avoid the problem of "garbage data in leading to garbage data out".
2. Data storage
In the data storage stage, data analysts need to understand the internal working mechanism and process of data storage. The core factor is what processing is done on the basis of the original data, and what kind of data is finally obtained. Since data is constantly dynamically changing and iteratively updated in the storage stage, its timeliness, completeness, validity, consistency, and accuracy are often not guaranteed due to software and hardware, internal and external environmental issues, which will cause later data application problems.
3. Data extraction
Data extraction is the process of extracting data. The core of data extraction is where, when, and how.
In the data extraction stage, data analysts first need to have data extraction capabilities. The commonly used Select From statement is a necessary skill for SQL query and extraction, but even simple access work has different levels.
The first layer is the ability to extract data conditionally from a single database, where is the basic conditional sentence;
The second layer is to master the ability to extract data across database tables. Different joins have different usages;
The third layer is to optimize SQL statements. By optimizing nesting, filtering logic levels and traversal times, etc., personal time waste and system resource consumption are reduced.
4. Data mining
Data mining is the key to extracting data value when faced with massive amounts of data. The following are the basic principles of algorithm selection:
There is no best algorithm, only the most suitable algorithm. The principle of algorithm selection is to have both accuracy, operability, understandability, and applicability.
No one algorithm can solve all problems, but proficient in one algorithm can solve many problems.
The most difficult part of mining algorithms is algorithm tuning. The same algorithm has the same parameter settings in different scenarios. Practice is an important way to gain tuning experience.
In the data mining stage, data analysts need to master data mining related capabilities: first, data mining, statistics, basic principles and common sense of mathematics; second, proficient in using a data mining tool, Clementine, SAS or R are optional, if Program origin can also choose programming; third, you need to understand the commonly used data mining algorithms and the application scenarios and differences of each algorithm.
5. Data analysis
Analyzing data is to process, sort and analyze the collected data to transform it into information. The usual methods are:
The old seven tools, namely, Pareto Diagram, Causality Diagram, Stratification Method, Questionnaire, Walking Diagram, Histogram, Control Chart;
Seven new tools, namely association diagram, system diagram, matrix diagram, KJ method, plan review technology, PDPC method, matrix data diagram;
Compared with data mining, data analysis is more biased towards business application and interpretation. After the data mining algorithm draws a conclusion, how to explain the actual significance of the algorithm in terms of results, credibility, and significance to the business, and how to feedback the mining results Facilitating business understanding and implementation is the key to business operations.
6. Data visualization
There is a classic saying in the field of data analysis. Not to mention ordinary people, data analysts look at the data themselves. At this time, it depends on the magical power of data visualization. Except for advanced analysis such as data mining, one of the usual jobs of many data analysts is to monitor and observe data.
7. Data application
Data application is a direct manifestation of the value of data. This process requires data analysts to have data communication capabilities, business promotion capabilities, and project work capabilities.
Data communication ability. In-depth data reports and concise data conclusions are more conducive to business understanding and acceptance.
Business driving ability. On the basis of business understanding of data, promote business implementation to implement data recommendations.
Project work ability. Data project work is a gradual process. Whether it is a data analysis project or a data product project, data analysts are required to have the ability to plan, lead, organize, and control project work.
Attachment: common methods of data analysis
1. Descriptive statistical analysis
Including the description of the basic data of the sample, the frequency distribution and percentage analysis of each variable to understand the distribution of the sample.
In addition, the average and standard deviation are used to describe the various aspects of market orientation, competitive advantage, organizational performance, etc., to understand the perception of the management personnel of the sample companies on these related variables, and use t-test and related analysis to cause the background variables Influence to be tested.
2. Analysis of Cronbach’a reliability coefficient
Reliability refers to the consistency, stability, and reliability of test results. Generally, internal consistency is used to express the reliability of the test. The higher the reliability coefficient, the more consistent and stable the test results. And reliable.
Conduct Cronbach’a reliability analysis for the measurement items of each research variable to understand the internal consistency of measurement dimensions. Generally speaking, Cronbach’a is only greater than 0.7 as high reliability, less than 0.35 as low reliability (Cuieford, 1965), and 0.5 is the lowest acceptable level of reliability (Nunnally, 1978).
3. Exploratory factor analysis (exploratory factor analysis) and confirmatory factor analysis (confirmatory factor analysis)
It is used to test the convergent validity and discriminant validity of items in various dimensions, because reliability alone is not enough. Measurements with high reliability may be completely invalid or some To the extent it is invalid, so we must test the validity.
Validity refers to whether the tool can measure the results you want to measure at design time. The test of convergence validity is determined by the load of each item and the concept factor being measured, and the test of discriminative validity is based on test factors Analyze and calculate the correlation coefficient of theoretically related concepts, and check whether the 95% confidence interval of the correlation coefficient contains 1.0. If it does not contain 1.0, it can be confirmed as having discriminative validity (Anderson, 1987).
4. Structural equations modeling (structural equations modeling)
Because the structural equation model combines factor analysis and path analysis, and incorporates the simultaneous equations of econometrics, it can handle multiple dependent variables at the same time, allowing independent variables and dependent variables to contain measurement errors. Estimating factor structure and factor relationships allows for more flexible measurement models and estimates the degree of fit of the entire model (Bollen and Long, 1993), which is suitable for the causality of the overall model.
In the estimation of model parameters, the maximum likelihood estimation method (Maximum Likelihood, ML) is adopted; in the model fitness test, basic fit criteria (preliminary fit criteria) and overall model fit (overall model fit) are used. ) And the fit of internal structure of model (Bagozzi and Yi, 1988) as the criteria for judgment.
In evaluating the overall model adaptation standard, this study uses x2 (chi-square)/df (degrees of freedom) value, goodness of fit index (goodness.of.f: iJt.in.dex, GFI), and average residual square root (Root-mean.square: residual, RMSR), root-mean-square-error-of-approximation (RMSEA) and other indicators; refer to Bagozzi and Yi (1988) for the goodness of fit of the model’s internal structure The standard to examine whether the estimated parameters have reached a significant level.