Purpose of Data Cleaning

Source: Internet
Author: User
Keywords data cleaning data cleaning purpose purpose of data cleaning process
Data cleaning is an indispensable link in the entire data analysis process, and the quality of the results is directly related to the model effect and final conclusion. In actual operation, data cleaning usually occupies 50% -80% of the analysis process.

The purpose of data cleaning is from two perspectives:

The first is to solve the data quality problem, the second is to make the data more suitable for mining, display, analysis

1. Solve data quality problems
Solving data quality problems is actually to ensure the following:

We will look at each point separately

Data integrity
For example, the lack of gender and age in the attributes of people

Uniqueness of data
For example, the data from different sources are duplicated, such as the serial number in our basic information in this data, and some duplicate data. This may be caused by data entry twice.

The authority of the data
For example, the same indicator has data from multiple sources, and the values are different

Legality of data
For example, the data obtained is inconsistent with common sense, the age is greater than 150 years old

Data consistency
For example, different indicators from different sources, the actual connotation is the same, or the connotation of the same indicator is inconsistent

2. Make the data more suitable for mining, display and analysis
From this perspective, the work of data cleaning is more engineering-oriented, which is not the focus of our attention this time.
To make the data more suitable for mining, display, and analysis, there are some methods to clean the data.

High dimension-not suitable for mining
Idea: Dimensionality reduction, methods include but not limited to:
PCA
Random forest

The dimension is too low-not suitable for mining
Idea: Abstract, methods include but not limited to:
Various summary, average, total, maximum, minimum, etc.
Various discretization, clustering, custom grouping, etc.

Irrelevant information-reduce storage
Solution: remove the field

Field redundancy
One field is calculated by other fields, which will cause the correlation coefficient to be 1 or the main cause analysis to be abnormal
Solution: remove the field

Multiple indicator values and different units
If the difference between GDP and per capita income of urban residents is too large
Solution: Normalization, including but not limited to:
Minimum Maximum
Zero-mean
Decimal scaling
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.