Introduction of Data Integration

Last Update:2020-06-17 Source: Internet

Author: User

Keywords data integration data integration techniques data integration meaning

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Data integration content

Definition: Data integration is to integrate interrelated distributed heterogeneous data sources, so that users can access these data sources in a transparent manner.

Integration refers to maintaining the overall data consistency of the data source and improving the efficiency of information sharing and utilization;

The transparent method means that users do not need to care about how to access data from heterogeneous data sources, but only care about what data is accessed in which way.

Difficulties in data integration:

(1) Heterogeneity: The integrated data source is usually developed independently, and the data model is heterogeneous, which brings great difficulties to integration. These heterogeneities are mainly manifested in: data semantics, the expression form of the same semantic data, the use environment of data sources, etc.

(2) Distribution: The data sources are distributed in different places, and rely on the network to transmit data. This has the problems of network transmission performance and security.

(3) Autonomy: Each data source has strong autonomy. They can change their own structure and data without notifying the integrated system, which challenges the robustness of the data integration system.

Data integration method:

1. Pattern integration method:

When building an integrated system, the data views of each data source are integrated into a global mode, allowing users to transparently access the data of each data source according to the global mode. The global model describes the structure, semantics, and operations of data shared by data sources. Users submit requests directly on the basis of the global model. The data integration system processes these requests and converts them into requests that can be executed by various data sources based on the local data view. The characteristic of the pattern integration method is to directly provide users with transparent data access methods.

Mode integration needs to solve two basic problems: construct the mapping relationship between the global mode and the data source data view; handle user query requests based on the global mode.

Federated database and middleware integration methods are the two typical model integration methods.

Federated database is a pattern integration method adopted by early people. The data sources in the federated database share some of their own data patterns to form a federated pattern. Federal database systems can be divided into two categories according to the degree of integration: the use of tightly coupled federal database systems and the use of loosely coupled federal database systems.

Tightly coupled: The federated database system uses a unified global model to map the data model of each data source to the global data model, solving the heterogeneity between data sources. This method has higher integration and less user participation; the disadvantage is that the algorithm for constructing a global data model is complex and has poor scalability.

Loose coupling: The federal database system is special, there is no global mode, and the federal mode is used. This method provides a unified query language, leaving many heterogeneous problems to users to solve. The loose coupling method is not highly integrated with data, but its data source has strong autonomy and good dynamic performance, and the integrated system does not need to maintain a global model.

The middleware integration method is another typical mode integration method, which also uses the global data mode. G.wiedehrold gave the earliest framework for integration methods based on middleware. Unlike a federated database, a middleware system can not only integrate structured data source information, but also integrate information from semi-structured or unstructured data sources, such as web information. Stanford University Gare:a-Molina and others developed the TSIMMIS system, which is a typical middleware integration system.

A typical middleware-based data integration system (Figure 2) mainly includes middleware and wrappers, where each data source corresponds to a wrapper, and the middleware interacts with each data source through the wrapper. Users send query requests to middleware based on the global data model. Middleware processes user requests, converts them into sub-query requests that can be processed by various data sources, and optimizes this process to improve the concurrency of query processing, Reduce response time. The wrapper encapsulates a specific data source, converts its data model to the general model used by the system, and provides a consistent access mechanism. The middleware sends each sub-query request to the wrapper, and the wrapper interacts with its encapsulated data source, executes the sub-query request, and returns the result to the middleware.

Middleware focuses on the processing and optimization of global queries. Compared with the federal database system, the advantages are: it can integrate non-database data sources, has good query performance, and has strong autonomy; the disadvantage of middleware integration is that it is usually Read-only, while the federal database supports both reading and writing.

A

2. Data copy method

Copy the data of each data source to other related data sources, and maintain the data consistency of the data source as a whole, and improve the efficiency of information sharing and utilization.

Data warehouse: This method copies data from various data sources to the same place-a data warehouse. The user directly accesses the data warehouse like an ordinary database.

Data heterogeneity issues:

The difference between grammatical heterogeneity and semantic heterogeneity can be traced back to the differences in data source modeling: when the entity relationship model of the data source is the same, but the naming rules are different, it is only the grammatical heterogeneity between the data sources; when the data source When constructing an entity model, if different granularity divisions, different relationships between entities, and different field data semantic representations are used, it will inevitably cause semantic heterogeneity between data sources, which will bring great trouble to data integration.

Syntactic heterogeneity: generally refers to the difference in naming rules and data types between source data and destination data. For databases, naming rules refer to table names and field names. Syntactic heterogeneity is relatively simple, as long as field to field and record to record are implemented Mapping to resolve name conflicts and data type conflicts. This mapping is straightforward and relatively easy to implement.

Semantic heterogeneity: field splitting, field merging, field data format conversion, and field transfer between records. A

3. Comprehensive integration method:

The mode integration method provides users with a global data view and a unified access interface with high transparency; but this method does not implement data interaction between data sources, and users often need to access multiple data sources when using them, so this method requires the system to have Good network performance.

Data copy method: Before the user uses a certain data source, the data of other data sources that the user may use is copied in advance. The user only needs to access a certain data source or a few data sources, which will greatly improve the system. The efficiency of processing user requests; but there is usually a delay in data replication. When using this method, it is difficult to guarantee the real-time consistency of data between data sources.

In order to break through the limitations of the two methods, people usually use the two methods together, the so-called integrated method. The integrated method is usually to find a way to improve the performance of the middleware-based system. The method still has a virtual data mode view for users to use, and at the same time it can copy the data commonly used between data sources. For simple user access requests, the integrated method is always Try your best to use data replication. Realize the user's inter-site requirements on a local data source or a single data source; and for those complex user requests that cannot be achieved through data replication, use the virtual view method.

4. Other technologies:

Grid technology:

Now, the data analysis and calculations to be carried out by scientific research institutes have become increasingly complex, requiring the cooperation of multiple devices and multiple systems. Provide a virtual giant supercomputer system. The ultimate goal of data grid technology is to establish an integrated storage, management, access, transmission and service architecture and environment for massive data in a heterogeneous distribution environment. In short, data grid technology mainly solves the problem of wide area Distributed, heterogeneous, and unified access and management of mass storage resources in the environment can solve the problem that mass data is difficult to organize and handle. Data grid technology is developed on the basis of computing grid technology. It has great scientific research and application value for large-scale scientific research of data sets. It is a large-scale scientific application and data-intensive or collaborative feature in a wide area. The research provides a supporting platform.

Ontology technology:

Ontology is an explicit description of the concepts in a certain field and their relationships. It is a key technology of the semantic network. Ontology technology can clearly express the semantics of data and support automatic reasoning based on description logic, which provides new ideas for the solution of semantic heterogeneity problems, and should have great significance for heterogeneous data integration.

Adopt the combination of ontology technology and middleware: adopt middleware architecture, support virtual views or view collections, and do not store the actual data in any heterogeneous database. In order to better solve the semantic heterogeneity, an ontology library is introduced in the middleware.

Application layer: provides users with an interface to access the database.

Middleware layer: The middleware layer shields the distribution and heterogeneity of data sources from a higher level. The user believes that all data is local and in the same service domain, and the processing of specific query requests and the return of results are the responsibility of the middle layer. The middleware is mainly composed of three parts: mediator, wrapper and ontology library. Among them, the mediator includes four functional components: query generator, query decomposition engine, query execution engine and result processing.

Data source layer: Each data source adopts local management to manage the data.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More