Keywordsdata cleaning data cleaning purpose purpose of data cleaning process
Data cleaning is an indispensable link in the entire data analysis process, and the quality of the results is directly related to the model effect and final conclusion. In actual operation, data cleaning usually occupies 50% -80% of the analysis process.
The purpose of
data cleaning is from two perspectives:
The first is to solve the data quality problem, the second is to make the data more suitable for mining, display, analysis
1. Solve data quality problems
Solving data quality problems is actually to ensure the following:
We will look at each point separately
Data integrity
For example, the lack of gender and age in the attributes of people
Uniqueness of data
For example, the data from different sources are duplicated, such as the serial number in our basic information in this data, and some duplicate data. This may be caused by data entry twice.
The authority of the data
For example, the same indicator has data from multiple sources, and the values are different
Legality of data
For example, the data obtained is inconsistent with common sense, the age is greater than 150 years old
Data consistency
For example, different indicators from different sources, the actual connotation is the same, or the connotation of the same indicator is inconsistent
2. Make the data more suitable for mining, display and analysis
From this perspective, the work of data cleaning is more engineering-oriented, which is not the focus of our attention this time.
To make the data more suitable for mining, display, and analysis, there are some methods to clean the data.
High dimension-not suitable for mining
Idea: Dimensionality reduction, methods include but not limited to:
PCA
Random forest
The dimension is too low-not suitable for mining
Idea: Abstract, methods include but not limited to:
Various summary, average, total, maximum, minimum, etc.
Various discretization, clustering, custom grouping, etc.
Irrelevant information-reduce storage
Solution: remove the field
Field redundancy
One field is calculated by other fields, which will cause the correlation coefficient to be 1 or the main cause analysis to be abnormal
Solution: remove the field
Multiple indicator values and different units
If the difference between GDP and per capita income of urban residents is too large
Solution: Normalization, including but not limited to:
Minimum Maximum
Zero-mean
Decimal scaling
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.