Data Integration of Big Data Preprocessing

Source: Internet
Author: User
Keywords data integration data integration definition big data preprocessing
Data processing often involves data integration operations, that is, data from multiple data sources, such as databases, data cubes, ordinary files, etc., are combined to form a unified data set to provide a complete data foundation for the smooth completion of data processing .

In the data integration process, the following issues need to be considered.

1. Mode integration issues

The problem of pattern integration is how to match real-world entities from multiple data sources with each other, which involves entity recognition.

For example, how to determine whether "custom_id" in one database and "custome_number" in another database represent the same entity.

Databases and data warehouses usually contain metadata, which can help avoid errors in schema integration.

2. Redundancy

Redundancy is another problem that often occurs in data integration. If an attribute can be deduced from other attributes, then this attribute is redundant.

For example, the average monthly income attribute in a customer data table is a redundant attribute, obviously it can be calculated based on the monthly income attribute. In addition, the inconsistency of attribute naming will also cause data redundancy in the integrated data set.

Using correlation analysis can help find some data redundancy.

For example, given two attributes A and B, the relationship between these two attributes can be analyzed based on the values of these two attributes.

If the correlation value between two attributes is r>0, it means that there is a positive correlation between the two attributes, that is, if A increases, B also increases. The larger the r value, the closer the positive relationship between attributes A and E.

If the correlation value yields 0, it means that attributes A and B are independent of each other, and there is no relationship between the two. If r<0, it means that there is a negative correlation between attributes A and B, that is, if A increases, B decreases. The larger the absolute value of r, the closer the negative relationship between attributes A and B.

3. Data value conflict detection and elimination

In real-world entities, attribute values from different data sources may be different. The reasons for this problem may be differences in representation, scale, or coding.

For example, the weight attribute uses the metric system in one system, but the imperial system in another system; the price attribute uses different currency units in different locations. These semantic differences bring many problems to data integration.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.