Data quality analysis

Source: Internet
Author: User

Transferred from: http://www.tipdm.org/ganhuofenxiang/1026.jhtml

Data quality analysis is an important part of data mining, and the wrong assumptions and bad data problems are the important reasons that result in the deviation of data mining results. Data mining practitioners often say "garbage in, Garbage out" is "garbage in, Garbage out", loaded data is garbage, calculated results are garbage. Many times we attach great importance to the algorithm, and ignore the data itself, the algorithm is important, but the quality of the complete data is better than the good algorithm, assuming the same data quality, the Data feature selection is reasonable, even if the law itself is not particularly large differences.

So, based on the above understanding, before doing data mining modeling, often first to do the relevant data preparation, today we focus on data quality analysis.

The main task of data quality analysis is to check the raw data for dirty data, dirty data generally refers to non-conformance requirements, and can not be directly used for modeling analysis of data, mainly including:

Missing value

Exception value

Inconsistent values

Data that repeats and contains special symbols (such as #, ¥, *)

Why missing values are generated

Data loss mainly includes missing records and missing a field in the record, there are many reasons for missing data:

Some information is not available temporarily, or the cost of obtaining information is too high.

Information is missing. There are two cases, one is that the factors, the input is not important, forget to fill out or understand the data error, the other is a physical failure, data acquisition equipment, storage media, transmission media failure.

The property value does not exist. In some cases, missing values do not imply an error in the data, such as the spouse's name of an unmarried person, a child's fixed income, and so on.

Effects of missing values

Lack of data will not only affect the normal understanding of the business, but also affect the quality of modeling. The lack of data makes it possible to lose a lot of useful information during modeling, and some models cannot handle missing values, such as SVN, which can eventually lead to confusion and output unreliable information.

Outlier analysis

Outlier analysis is a test of whether the data contains typographical errors and contains irrational data. Outliers, also known as outliers, behave as individual values in the sample, and their values deviate significantly from the rest of the observations. The analysis of outliers is also called outlier analysis. The results can be adversely affected by the inclusion of outliers in the calculation and analysis of data without exclusion. Analyzing the causes of outliers often becomes an opportunity to identify problems and improve decision-making. Common analysis methods: Simple measurement analysis, 3σ principle, box pattern analysis.

Simple measurement Analysis

The most commonly used statistics are the maximum and minimum values, judging whether the data in this variable is beyond a reasonable range. For example, if the maximum height of 5 meters, then the data of the variable is abnormal (giant disease is not so high), a national cadre resume: 12 years of age to work or a client age of 199 years, these are outliers.

Note that outliers are really "abnormal", to be combined with the business background to analyze the causes of its production, such as aviation information data, the ticket price is null, the minimum fare is 0, the minimum discount rate is 0, the total number of kilometers of miles more than 0 records. Normal analysis of this data may be problematic, but the combination of business to understand that such data is often caused by frequent flyer points redemption.

3σ Principles

If the data obeys a normal distribution, under the 3σ principle, outliers are defined as values that deviate from the mean by more than three times times the standard deviation.

Box Pattern Analysis

The box chart analysis can visually represent the original appearance of the data distribution, without having to assume that the data is subject to a specific distribution, without any restrictive requirements on the data. The criterion of judging outliers is based on four-bit and four-bit distance, and has certain robustness. Up to 25% of the data can become arbitrarily far without disturbing the four-bit number, so outliers cannot affect this standard.

Consistency analysis

In the process of data mining, the generation of inconsistent data mainly occurs in the process of data integration, the data is mined from different data sources, and the repeated data cannot be updated in a consistent way. If the user's address is stored in both tables, when the user's address changes, if only the data in one table is updated, then there is inconsistent data in both tables.

Repeating data and containing special symbols

If you encounter duplicate data and contain special data, check out the cause, the general situation of the treatment, all deleted.

Data quality analysis of common methods to introduce these, hope can bring help to you, pay attention to data quality analysis, do not let your data into a chicken.

Data quality analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.