[Introduction to Data Mining]-quality of data quality and quality of Introduction to Data Mining

Last Update:2014-07-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Introduction to Data Mining]-quality of data quality and quality of Introduction to Data Mining
Data quality
The data used by data mining is usually collected or collected for other purposes without explicit purpose. Therefore, data quality cannot be controlled at the data source. To avoid data quality problems, data mining focuses on two aspects: detection and correction of data quality problems (data cleaning); and the use of algorithms that can tolerate low quality data.
Measurement and data collection problemsPerfect data almost does not exist in reality. For existing data quality problems, we first define measurement errors and data collection errors, and then consider various problems of measurement errors: noise, pseudo image, bias, accuracy and accuracy. Next we will discuss the quality of measurement and data collection data: outlier, missing and inconsistent values, and repeated data.
Measurement error)It refers to problems caused by the measurement process. For example, the recorded value is different from the actual value.Data collection error)It refers to errors such as missing data objects or attribute values, or non-local inclusion of other data objects. For example, specific species may be mixed into data of similar species. Measurement and data collection errors may be system or random.
Noise is the immediate part of the measurement error. For example, 2-5 shows the time sequence after random noise interference. If there is a lot of noise, it may even mask the original data.
Figure 2-6 shows the three sets of data points before and after the added noise points.
Noise is usually used for data that contains temporal or spatial components. In such cases, the image or signal processing technology can be used to reduce noise, but it is very difficult to completely eliminate it. Therefore, data mining involvesRobust algorithm)That is, acceptable results can also be produced under noise interference. Data errors may also result from more deterministic phenomena, such as a group of data with the same errors in the same place. This deterministic distortion is calledArtifact)
Precision): Closeness between repeated measurementsBias (bias): The system variation between the measured value and the measured value. Assuming we have a standard weight of 1 gram, to evaluate the accuracy and bias of the new balance, weigh 5 times to get {1.015, 0990, 1.013, 1.001, 0.986} the average value of these values is 1.001, so the bias is 0.001 ,. The standard deviation is used for measurement. The accuracy is 0.013.
Accuracy: The closeness between the measured value and the actual value. Accuracy depends on precision and bias. Another important aspect isSignificant digit)The purpose is to only use the number of digits that can be determined by the data accuracy to represent the measurement or calculation results.
Outlier)It is a data object that has characteristics different from most of the other data objects in the dataset, or an unusual attribute value relative to the typical value of this attribute. CalledException (anomalous)Object or abnormal value. It should be noted that the difference between noise and outlier can be a valid data object or value. Therefore, unlike noise, the outlier itself is an object of interest.
Missing ValueIt is common for an object to omit one or more attribute values. Sometimes, information collection may be incomplete. However, in any case, the missing values should be considered during data analysis. How to deal with missing values:

Delete data objects or attributes
Estimated Missing Values
Ignore missing values during analysis

Inconsistent valuesThe data may contain inconsistent values, such as incorrect account and password entered by mistake. No matter what causes the inconsistent values, it is important to detect and correct them.
Duplicate dataA dataset may contain duplicate data objects. Duplicate data is usually detected and deleted. However, before performing these steps, you have to deal with two problems: if the two objects actually represent the same object, the corresponding attribute values must be different and these inconsistent values must be solved. Avoid merging two similar but non-repeated data objects. deduplication usually indicates this process.

Introduction to Data Mining ebook

Go to the provincial bookstore to buy

What's the next project

Welcome! Welcome! Welcome! Welcome! Welcome! Welcome!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Introduction to Data Mining]-quality of data quality and quality of Introduction to Data Mining

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Introduction to Data Mining]-quality of data quality and quality of Introduction to Data Mining

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support