Database basics: Why do I have to do the pre-processing data

Last Update:2017-02-27 Source: Internet

Author: User

Tags format

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today's real-world databases are extremely vulnerable to noise, loss of data, and inconsistent data because the database is too large (often up to thousands of megabytes or more), and most of it comes from multiple heterogeneous data sources. Low-quality data will result in low quality mining results. "How to preprocess data to improve the quality of data, and thus to improve the quality of mining results?" How to preprocess the data to make the mining process more efficient and easier?

There is a large number of data preprocessing techniques. Data cleansing can be used to remove noise from the data and correct inconsistencies. Data integration combines data from multiple sources into a consistent data store, such as a data warehouse. You can also use data transformations, such as normalization. For example, normalization can improve the accuracy and effectiveness of mining algorithms that involve distance metrics. Data reduction can reduce the scale of data by aggregating, deleting redundant features or clustering. These technologies are not mutually exclusive and can be used together. For example, data cleanup may involve correcting the transformation of error data, such as converting a date field into a common format. These data processing techniques are used prior to excavation and can significantly improve the overall quality of the mining pattern and/or reduce the time required for actual mining.

This paper introduces the basic concepts of data preprocessing, and introduces the descriptive data summarization as the basis of data preprocessing. Descriptive data aggregation helps us to study the general characteristics of data, identify noise or outlier points, and is useful for successful data cleansing and data integration. Data preprocessing methods are organized as follows: Data cleansing, data integration and transformation, and data reduction. Conceptual hierarchies can be used as an alternative to data reduction, where low-level data (such as the original value of the age) are replaced with high-level concepts such as youth, midlife, or old age. This form of data reduction, where we discuss the use of data discretization technology, automatic generation of conceptual layering from numerical data.

Why to Preprocess data

Imagine that you are the manager of Allelectronics, who is responsible for analyzing the sales figures in your department. You immediately proceed with this work, carefully review the company's database and data Warehouse, identify and select attributes or dimensions that should be included in the analysis, such as item, Price, and Units_sold. Ah! You notice that many tuples have no values on some properties. To make an analysis, I would like to know whether each purchased item was advertised for sale, but it was found that the information was not recorded. In addition, your database system users have reported some errors, unusual values, and inconsistencies in some transaction records. In other words, you want

Data that is parsed using data mining techniques is incomplete (missing attribute values or some attributes of interest). Or contains only aggregated data), noisy (contains errors or deviations from expected outliers), and is inconsistent (for example, there are differences in departmental encodings for commodity classifications). Welcome to the Real world!

The existence of incomplete, noisy and inconsistent data is a common feature of large databases or data warehouses in the real world. Incomplete data can occur for a number of reasons. Some attributes of interest, such as customer information in sales transaction data, are not always available. The other data is not included because the input is considered unimportant. The relevant data is not recorded either because of an understanding error or because of a device failure. Data that is inconsistent with other records may have been deleted. In addition, the record history or modified data may be ignored. Missing data, especially tuples that are missing values on some properties, may need to be inferred.

There may be several reasons for the data to contain noise (with incorrect property values). A device that collects data may fail, and a person or computer error may occur when data is entered, and errors in the data transfer may occur. These could be due to technical limitations, such as the size of the buffer used for data transfer synchronization. Incorrect data may also be caused by inconsistent naming conventions or data codes used, or by the format of input fields, such as dates. Duplicate tuples also require data cleanup.

Data cleanup Routines "clean up" data by filling in missing values, smoothing noise data, identifying or deleting outliers, and resolving inconsistencies. If the user thinks the data is dirty, they will not believe the data mining results. In addition, dirty data causes the mining process to fall into chaos, resulting in unreliable output. Although most mining routines have processes that handle incomplete or noisy data, they are not always robust. Instead, they focus on avoiding modeling functions that are overly fitted with data. Therefore, a useful preprocessing step is to use some cleanup routines to process the data. Section 2.3 discusses ways to clean up data. Back to your task in allelectronics, assuming that the analysis contains data from multiple data sources. This involves integrating more than 48 databases, data cubes, or files, that is, data integration. Attributes that represent the same concept may have different names in different databases, which results in inconsistencies and redundancy. For example, the customer identity attribute may be customer_id in one database and cust_id in the other. Named inconsistencies may also appear in attribute values. For example, the name of the same person may be registered as "Bill" in a database, "William" in the second database, and "B" in the third database. Additionally, you may be aware that some properties may be exported by other attributes, such as yearly income. A large amount of redundant data may reduce the performance of the knowledge discovery process or bring it into confusion. It is clear that, in addition to data cleansing, steps must be taken to avoid data redundancy in data integration. Typically, data cleanup and integration are performed as preprocessing steps when preparing data for the Data warehouse. You can also perform data cleanup again, detect and delete redundancy that may result from integration.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More