Data preprocessing 1, data Audit: Check the data for errors
Raw data-Integrity: Whether the object being investigated is missing.
Accuracy: Data is error, abnormal value exists
Outliers: Record errors, correct them, correct values, and keep them.
Applicability of second-hand data: Identify the source, caliber, and background material of the data to determine whether the data meets the needs of analytical research.
Timeliness: For the more timeliness of the problem, if the data is lagging for research is not much significance.
2. Data filtering
Tools: Excel, SPSS, a lot of online information, here no longer repeat.
3. Sorting Data
1) Sort in a certain order in order to identify obvious features or trends;
2) to facilitate the correction of data, re-classification, grouping.
Sorting and displaying of quality data
After preprocessing, the data need to be further classified and grouped.
Quality data: categorical data, sequential data
1. Collation and illustration of classification data
Categorical data: a sort of thing
Collation: List the categories, calculate the frequency, frequency or proportion of each category, rate, make a frequency distribution table.
Objective: To understand the data and its characteristics in a preliminary way
Tools: Excel, SPSS pre-data analysis, these tools have been very smart!
Diagram: Bar graph, Pareto Pareto, pie chart, doughnut chart
2. Collation and illustration of sequential data
Sequential data: "Go to Baidu Encyclopedia"
Https://baike.baidu.com/item/%E9%A1%BA%E5%BA%8F%E6%95%B0%E6%8D%AE/9210375?fr=aladdin
Sorting: List the categories, calculate the frequency, frequency or proportion, ratio of each category, make the frequency distribution table, and calculate the cumulative frequency (or frequency).
Objective: To understand the data and its characteristics in a preliminary way
Tools: Excel, SPSS pre-data analysis, these tools have been very smart!
Diagram: Bar chart, Pareto Pareto, pie chart, doughnut chart, cumulative frequency distribution or frequency graph.
The arrangement and display of numerical data
Numerical data can be used in addition to the collation of quality data and graphic methods, there are some special methods.
1, the data grouping: the observation data distribution characteristic
Single-Variable value grouping: Applies to discrete variables with less variable values.
Group distance Grouping: Applies to continuous variables with more variable values.
Ex: grouping methods and their watchmaking processes
Step1: Determines the number of groups. The determination of group number is mainly used for the observation of data characteristics, so it depends on its data characteristics.
Step2: Determines the group spacing for each group. Group distance = Upper limit of group-the lower bound of the group. Determination of the spacing: (upper limit of all data-lower limit of all data)/number of groups
STEP3: The frequency distribution table is arranged according to the grouping.
2. Illustration of numerical data
Grouped data: histograms
ungrouped data: stem-leaf plots, box-line plots
Stem and leaf diagram: reflects the original data distribution shape, discrete state (whether symmetrical, concentrated, existence outliers)
Tools: Excel, SPSS are very convenient
Box-line diagram: Maximum, minimum, median, two four-digit number of data by a set of values
time Series data-line graph: reflect the characteristics of change over time
Graphical representation of multivariable data: Scatter plots, bubble charts, radar charts
Scatter plot: A characterization of the relationship between 2 variables
Bubble chart: A characterization of the relationship between 3 variables
Radar diagram: Characterization of relationships between multiple variables
Tools: Excel, SPSS are very convenient
Data mining--statistical analysis (I: Data collation and representation)