# Data mining--statistical analysis (I: Data collation and representation)

Source: Internet
Author: User

Data preprocessing 1, data Audit: Check the data for errors

Raw data-Integrity: Whether the object being investigated is missing.

Accuracy: Data is error, abnormal value exists

Outliers: Record errors, correct them, correct values, and keep them.

Applicability of second-hand data: Identify the source, caliber, and background material of the data to determine whether the data meets the needs of analytical research.

Timeliness: For the more timeliness of the problem, if the data is lagging for research is not much significance.

2. Data filtering

Tools: Excel, SPSS, a lot of online information, here no longer repeat.

3. Sorting Data

1) Sort in a certain order in order to identify obvious features or trends;

2) to facilitate the correction of data, re-classification, grouping.

Sorting and displaying of quality data

After preprocessing, the data need to be further classified and grouped.

Quality data: categorical data, sequential data

1. Collation and illustration of classification data

Categorical data: a sort of thing

Collation: List the categories, calculate the frequency, frequency or proportion of each category, rate, make a frequency distribution table.

Objective: To understand the data and its characteristics in a preliminary way

Tools: Excel, SPSS pre-data analysis, these tools have been very smart!

Diagram: Bar graph, Pareto Pareto, pie chart, doughnut chart

2. Collation and illustration of sequential data

Sequential data: "Go to Baidu Encyclopedia"

Sorting: List the categories, calculate the frequency, frequency or proportion, ratio of each category, make the frequency distribution table, and calculate the cumulative frequency (or frequency).

Objective: To understand the data and its characteristics in a preliminary way

Tools: Excel, SPSS pre-data analysis, these tools have been very smart!

Diagram: Bar chart, Pareto Pareto, pie chart, doughnut chart, cumulative frequency distribution or frequency graph.

The arrangement and display of numerical data

Numerical data can be used in addition to the collation of quality data and graphic methods, there are some special methods.

1, the data grouping: the observation data distribution characteristic

Single-Variable value grouping: Applies to discrete variables with less variable values.

Group distance Grouping: Applies to continuous variables with more variable values.

Ex: grouping methods and their watchmaking processes

Step1: Determines the number of groups. The determination of group number is mainly used for the observation of data characteristics, so it depends on its data characteristics.

Step2: Determines the group spacing for each group. Group distance = Upper limit of group-the lower bound of the group. Determination of the spacing: (upper limit of all data-lower limit of all data)/number of groups

STEP3: The frequency distribution table is arranged according to the grouping.

2. Illustration of numerical data

Grouped data: histograms

ungrouped data: stem-leaf plots, box-line plots

Stem and leaf diagram: reflects the original data distribution shape, discrete state (whether symmetrical, concentrated, existence outliers)

Tools: Excel, SPSS are very convenient

Box-line diagram: Maximum, minimum, median, two four-digit number of data by a set of values

time Series data-line graph: reflect the characteristics of change over time

Graphical representation of multivariable data: Scatter plots, bubble charts, radar charts

Scatter plot: A characterization of the relationship between 2 variables

Bubble chart: A characterization of the relationship between 3 variables

Radar diagram: Characterization of relationships between multiple variables

Tools: Excel, SPSS are very convenient

Data mining--statistical analysis (I: Data collation and representation)

Related Keywords:

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

## A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

• #### Sales Support

1 on 1 presale consultation

• #### After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

• Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.