Why does basic database analysis require data preprocessing?

Source: Internet
Author: User

Data preprocessing is very important, but it seems more difficult to do data preprocessing well .....

Bytes -----------------------------------------------------------------------------------------------------------------------

Today's real-world databases are vulnerable to noise, data loss, and inconsistent data, because the database is too large (often several gigabytes or more ), most of them come from multiple heterogeneous data sources. Low-quality data will lead to low-quality mining results. "How to pre-process data to improve data quality and improve the quality of mining results? How to pre-process data to make the mining process more effective and easier ?"

 

There are a large number of data preprocessing technologies. Data cleaning can be used to remove noise in data and correct inconsistencies. Data integration combines data from multiple sources into a consistent data storage, such as a data warehouse. You can also use data transformations, such as normalization. For example, normalization can improve the accuracy and effectiveness of Distance Measurement mining algorithms. Data reduction can reduce the data scale by clustering, deleting redundant features, or clustering. These technologies are not mutually exclusive and can be used together. For example, data cleaning may involve data transformation to correct errors, for example, converting the date field to a common format. Before mining, these data processing technologies can significantly improve the overall quality of the Mining Model and/or reduce the time required for actual mining.

 

This section describes the basic concepts of data preprocessing and the descriptive data summary that serves as the basis for data preprocessing. Descriptive data aggregation helps us study the general characteristics of data, identify noise or outliers, and is useful for successful data cleaning and data integration. The data preprocessing methods are organized as follows: data cleaning, data integration and transformation, and data reduction. Conceptual Hierarchy can be used as a form of replacement for data reduction, where low-level data (such as the original age value) is replaced by high-level concepts (such as young, middle-aged, or old. In this form of data reduction, we will discuss the use of data discretization technology, where numerical data will automatically generate a conceptual hierarchy.

Why data preprocessing?

 

Imagine that you are the manager of allelectronics, responsible for analyzing the company sales data involving your department. You immediately start this job, carefully review the company's database and data warehouse, identify and select attributes or dimensions that should be included in the analysis, such as item, price, and units_sold. Ah! You have noticed that many tuples have no values in some attributes. For analysis, I want to know whether each purchased item has made a sales advertisement, but I found that the information is not recorded. In addition, your database system users have reported some errors, unusual values, and inconsistencies in some transaction records. In other words, you want

 

The data analyzed by using data mining technology is incomplete (the attribute values are missing or certain attributes of interest, or only contain clustered data ), include noise (include errors or deviations from expected group values), and are inconsistent (for example, the department code used for product classification is different ). Welcome to the real world!

 

The presence of incomplete, noisy, and inconsistent data is a common feature of large databases or data warehouses in the real world. Incomplete data may occur for multiple reasons. Some attributes of interest, such as customer information in sales transaction data, are not always available. Other data is not included because input is considered unimportant. The data is not recorded because of an incorrect understanding or a device failure. Data inconsistent with other records may have been deleted. In addition, historical or modified data may be ignored. Missing data, especially the tuples with missing values in some attributes, may need to be deduced.

 

Data may contain noise (with incorrect attribute values) for multiple reasons. The device that collects data may be faulty. Human or computer errors may occur during data input, and data transmission errors may also occur. These may be due to technical restrictions, such as the buffer size limit for data transmission synchronization. Incorrect data may also be caused by inconsistent naming conventions or data code used, or inconsistent formats of input fields (such as dates. Duplicate tuples also need to be cleaned up.

 

Data cleanup routines "clean" data by entering missing values, smoothing noise data, identifying or deleting outliers, and solving inconsistencies. If users think that the data is dirty, they will not trust the data mining results. In addition, dirty data leads to confusion in the mining process, resulting in unreliable output. Although most mining routines process incomplete or noise data, they are not always robust. Instead, they focus on avoiding over-fitting data with modeling functions. Therefore, a useful preprocessing step is to use some cleanup routines to process data. Section 2.3 describes how to clean up data. Return to your task at allelectronics, assuming that the analysis contains data from multiple data sources. This involves integrating more than 48 databases, data cubes, or files, that is, data integration. Attribute representing the same concept may have different names in different databases, which leads to inconsistency and redundancy. For example, the customer identity attribute may be customer_id in one database, while the other is cust_id. Inconsistent names may also appear in attribute values. For example, the name of the same person may be "bill" in a database and "William" in the second database ", the third database is registered as "B ". In addition, you may notice that some attributes may be exported by other attributes (such as annual income. A large amount of redundant data may reduce the performance of the knowledge discovery process or cause confusion. Obviously, in addition to data cleaning, data integration must take steps to avoid data redundancy. Data cleaning and integration are usually used as preprocessing steps when preparing data for a data warehouse. You can also clean up data again to detect and delete redundancy that may be caused by integration.

 

Back to your data, suppose you decide to use distance-based mining algorithms such as neural networks, nearest neighbor classification, or clustering for analysis. If the data to be analyzed has been normalized, that is, the data is mapped to a specific interval [0.0, 1.0] proportionally, these methods can produce better results. For example, your customer data includes age and annual salary. The value range of the annual salary attribute may be much larger than that of the age. In this way, if the attribute is not normalized, the distance measurement's weight for the annual salary generally exceeds the distance measurement's weight for the age. In addition, it may be useful to analyze the aggregated information such as sales in each customer region. This information is not included in any pre-computed data cubes in your data warehouse. You soon realized that data transformation operations, such as standardization and aggregation, are successful preprocessing processes oriented to the mining process.

 

As you further consider the data, you want to know that "the dataset I selected for analysis is too large, and it will definitely reduce the speed of the mining process. Is there a way to compress my datasets without compromising the results of Data Mining ?" Data Reduction simplifies the representation of a dataset. It is much smaller, but can produce the same (or almost the same) analysis results. There are many data reduction policies, including data aggregation (such as creating data cubes) and attribute subset selection (for example, removing irrelevant attributes through correlation analysis), dimension reduction (for example, using encoding schemes such as least-Length Encoding or wavelet) and numerical reduction (for example, using Clustering or parameter models with smaller representation to "replace" the data ). Hierarchical generalization of concepts can also be used to "contract" data. Generalized replace lower-level concepts with higher-level concepts. For example, for customer location, replace city with region or 49 province_or_state. Concept hierarchy organizes concepts into different abstraction layers. Data Discretization is a form of data reduction, which is very useful for automatically generating conceptual layering from numerical data.

 

Summarizes the data preprocessing steps discussed here. Note that the preceding classification is not mutually exclusive. For example, the deletion of redundant data is both a form of data cleaning and a data reduction.

 

In summary, data in the real world is usually dirty, incomplete, and inconsistent. The data pre-processing technology can improve the neural network and the nearest neighbor classification in Chapter 6th, and the clustering is discussed in Chapter 7th.

 

Data quality helps improve the accuracy and performance of the subsequent mining process. Since high-quality decision-making depends on high-quality data, data preprocessing is an important step in the knowledge discovery process. Detecting Data exceptions, adjusting data as early as possible, and collecting data to be analyzed will produce high returns in the decision-making process. (Responsible editor: Wang Rui)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.