Keywordsdata integration data integration definition big data preprocessing
Data processing often involves data integration operations, that is, data from multiple data sources, such as databases, data cubes, ordinary files, etc., are combined to form a unified data set to provide a complete data foundation for the smooth completion of data processing .
In the
data integration process, the following issues need to be considered.
1. Mode integration issues
The problem of pattern integration is how to match real-world entities from multiple data sources with each other, which involves entity recognition.
For example, how to determine whether "custom_id" in one database and "custome_number" in another database represent the same entity.
Databases and data warehouses usually contain metadata, which can help avoid errors in schema integration.
2. Redundancy
Redundancy is another problem that often occurs in data integration. If an attribute can be deduced from other attributes, then this attribute is redundant.
For example, the average monthly income attribute in a customer data table is a redundant attribute, obviously it can be calculated based on the monthly income attribute. In addition, the inconsistency of attribute naming will also cause data redundancy in the integrated data set.
Using correlation analysis can help find some data redundancy.
For example, given two attributes A and B, the relationship between these two attributes can be analyzed based on the values of these two attributes.
If the correlation value between two attributes is r>0, it means that there is a positive correlation between the two attributes, that is, if A increases, B also increases. The larger the r value, the closer the positive relationship between attributes A and E.
If the correlation value yields 0, it means that attributes A and B are independent of each other, and there is no relationship between the two. If r<0, it means that there is a negative correlation between attributes A and B, that is, if A increases, B decreases. The larger the absolute value of r, the closer the negative relationship between attributes A and B.
3. Data value conflict detection and elimination
In real-world entities, attribute values from different data sources may be different. The reasons for this problem may be differences in representation, scale, or coding.
For example, the weight attribute uses the metric system in one system, but the imperial system in another system; the price attribute uses different currency units in different locations. These semantic differences bring many problems to data integration.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.