The N Way of data integration

Source: Internet
Author: User

According to my understanding of some enterprises, this recent few years in the process of enterprise information system is not less, what erp,pdm,csm,dserp and so on nearly seven or eight sets, to a certain extent, improve the enterprise's information management level, but ushered in another problem. Many of the data in the enterprise need to be maintained in different systems, and there is often a problem of inconsistent data between different systems, which requires integration between the systems. Because the system architectures are inconsistent, the current approach is primarily data-level integration. <?xml:namespace prefix = o ns = "Urn:schemas-microsoft-com:office:office"/>

I have summed up, according to real-time data integration can be divided into two kinds, real-time and non-real time. The current program is not real-time, for each system needs to consolidate data, is from a system to export the data in XML format, and then by another system timed to deal with. Non-real-time system is easy to implement, the bad place lies in the inability to realize the seamless integration of each system in real time. and real-time system data integration can be directly integrated with the database layer or through service-oriented architecture (SOA) to achieve, for different manufacturers of products, open database interface to other manufacturers generally not too good to accept, is a company's various products open interface between the project is also more difficult to accept, Personal feeling the trend of future development is mainly to use SOA to achieve data integration.

About SOA, the industry has been hot in the past two years, many companies ibm,sap,oracle and so on to give their own solutions, the scheme is more dazzling and not very good choice, but after the acquisition of BEA, Oracle companies in the server + The advantages of the database make their schemes have a small advantage over those of other companies.

Here's a little bit of information I've gathered about Oracle's real Time Data integration program and share it with you.

Real-time data integration is typically divided into two processes: the first is to integrate the data into the SOA architecture to form available information, and the second is to release the information in a manner that conforms to the SOA specification. The specific real-time data integration pattern can be divided into the following four kinds of different processing processes:

The first is the data processing and integration on the middleware layer, and the consolidated data is released in standard interface through the standard interface of the middleware layer.

There is a virtual data service layer on the middle layer, the layer is connected with various data sources of data layer through Jdbc,file adapter, application adapter and so on, mapping various data entities in the data source into the table of the middleware virtual data layer, the tables in the virtual data layer have only metadata, but do not store the actual production data. The user can define the data mapping relation on the virtual data layer using the visual graphical interface, and carry on the data processing integration, these data processing logic generally can store in the file or the database way. Well-defined data can be published in a variety of ways, such as Web Service,jdbc, data objects, and so on. When the user accesses the data in the virtual data layer through the middleware, according to the logic of the system definition, the virtual data layer will first extract the detail data from each data source to the virtual data layer, then the middleware will process it according to the data processing logic at design time, and finally the middleware can return the processed data in the format called by the interface.

The advantages of using the virtual Data Services layer are:

1. Processing is in the middleware server, relatively speaking, the data processing will be more flexible, application and the underlying data to achieve loose coupling.

2. When a request involves multiple underlying data sources, the underlying data access can be performed in a concurrent manner.

3. With the flexibility of middleware, data can be used in various ways to provide interface, thus greatly facilitate the development of various applications.

4. All data are taken from the data source in real time to ensure the timeliness of the data.

The problem with this is that data processing is done at the middleware level, one is to bring data from the source to the middleware layer data transmission, the second is the middleware is generally Java-EE architecture, its strength is not data processing, in the amount of data is not serious, when the amount of data is very large, its implementation mechanism is doomed to the efficiency of the problem.

The second is the processing of data in the data source layer, then the consolidated data is published to the middleware layer with standard interface, and the middleware layer is responsible for data access.

This approach is generally database manufacturers or ETL vendors recommend the way, according to the user's business requirements logic, first in the data source layer through the ETL tool Design The Data transformation process, and then release the process transformation logic into Web Services, at the same time, the transformed data is also published as a Web service, then the services are registered to the middleware layer, the current end user needs data services, it needs to invoke two Web services, the first is to transform the Web service, the Web service calls the corresponding ETL tools to integrate data processing, The consolidated data is then stored in a temporary table. The second service is to invoke the data service and take the processed data directly from the temporary table, which differs from the first pattern in that it puts the processing of the data into the data source layer, which has the advantage of:

1. ETL tools are born to do data integration, but also suitable for large data integration, so for large data volume efficiency will be very high.

2. In the data source layer integration can make full use of the processing capacity of the database, after all, the database is to do data processing experts.

3. Depending on the changing data capture function of the E-LT tool, incremental data processing can be performed.

4. Data transformation and data acquisition loose coupling, can realize asynchronous processing.

The problem with this pattern is that:

1. Since the processing of data depends on the processing capacity of the database, there must be a relational database system in all data sources, while the first mode is handled by the middleware and there is no limit to the data source.

2. In the application of the process design, you need to call two Web services, one for the transformation, once for the data read, the amount of data is very small, a little superfluous flavor.

The third is to integrate the data from the data layer into the ODS or data Warehouse, and then publish the processed data to the middleware layer with the standard interface.

to ensure a global view of the data for the enterprise, we can create a global operational The database ODS (operational Data storage), which maintains real-time synchronization with other data sources within the enterprise by changing data capture (change information capture), and when data in the data source changes, The CDC captures the changed data and synchronizes it to the ODS database through ETL tools or other means, such as the master Data management tool.

The last point is the data release format, in which the middleware layer is responsible for access to data, and the data in the ODS can be encapsulated as Web services posted on the middleware layer. When the current end business process needs to integrate data, direct access to the data in the ODS, if the data integration is more complex, we can according to the user's business needs, through the ETL tool or other tools (the second model) to the unified model layer of data processing into the summary data layer, and then access data from the summary data layer.

The fourth is to use data grid to integrate data layer data into the middle tier, to form a data grid, middleware is responsible for data processing, integration, and then published in a standard way. It is very similar to the first way, the integration of data processing and release are on the middleware layer, the only difference is that we use data grid technology in the middle layer to add an object buffer layer, data integration processing and access to the middleware layer, when the client access to data, All process patterns are no different from the first pattern, but the data need to be accessed through the data grid layer slow existence of the middleware layer, so reduce the data source access and network transmission time, access speed will be greatly accelerated, so that the first mode can be resolved to a certain extent, but the data processing still occurs in the middleware layer, If the middleware processing capacity is limited, the efficiency of the system will be limited.

Advantages of this model:

1. The system expansibility is good, the extensibility of the data grid layer determines the expansibility of the whole system.

2. When the processing capacity of the machine is insufficient, the performance can be greatly improved by cluster mode.

3. The real implementation of the foreground data and background data source loose coupling. The data grid is responsible for interacting with various background data sources.

His question is:

1. The process of processing and finishing the middleware layer data still exists.

2. If the application has been online, it needs to modify the application for the interface provided by the data grid.

The above four models each have their own application scope, in general, the closer the data processing to the bottom, the higher the efficiency, the less flexibility; the more upward, the lower the efficiency, the better the flexibility; in fact, all kinds of data integration patterns are not good or bad, the key is to see the business needs, as long as the business needs to meet enough.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.