N Ways of Data Integration

Last Update:2020-09-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

According to some companies that I know, in the past few years, there has been no shortage of systems in the process of corporate informatization. There are nearly seven or eight sets of ERP, PDM, CSM, DSERP, etc., which has improved the level of corporate information management to a certain extent. , But there is another problem. Many data of an enterprise need to be maintained in different systems, and data inconsistencies between different systems often occur, which requires integration between systems. Due to the inconsistency of the system architecture, the current approach is mainly data-level integration.
I summarized it, according to real-time data integration can be divided into two types, real-time and non-real-time. The current scheme is non-real-time. For the data that needs to be integrated between systems, one system regularly exports data in xml format, and then another system processes it regularly. Non-real-time systems are easier to implement, but the downside is that they cannot achieve seamless integration of various systems in real-time. The real-time system data integration can be achieved by direct integration of the database layer or through service-oriented architecture (SOA). For products of different manufacturers, open database interfaces are generally not acceptable to other manufacturers, that is, each company's Project open interfaces between products are also difficult to accept. I personally feel that the future development trend is mainly to use SOA to achieve data integration.
Regarding SOA, the industry has been very hot in the past two years. Many companies such as IBM, SAP, Oracle, etc. have provided their own solutions. The number of solutions is dazzling, and it is not a good choice. However, after Oracle acquired BEA , Their advantage in server + database makes their solution have a lot of advantages compared with other companies.
The following is a bit of information I collected about Oracle's real-time data integration solution, and I will share it with you.
Real-time data integration is generally divided into two processing processes: one is to integrate and process data according to the needs of the SOA architecture to form usable information, and the other is to publish the information in a manner that conforms to the SOA specification. The specific real-time data integration mode can be divided into the following four types according to the difference between the two processing processes:
The first is to process and integrate data on the middleware layer, and at the same time publish the integrated data through the standard interface of the middleware layer.
There is a virtual data service layer on the middle layer, which connects with various data sources of the data layer through JDBC, FILE adapter, application adapter, etc., and maps various data entities in the data source to virtual data of the middleware The tables in the layer and the tables in the virtual data layer only have metadata, and do not store actual production data. Users can use a visual graphical interface to define data mapping relationships on the virtual data layer, and perform data processing and integration. These data processing logics are generally stored in files or databases. The defined data can be published through various methods such as web service, JDBC, and data objects. When a user accesses data in the virtual data layer through middleware, the virtual data layer first extracts the detailed data that needs to be processed from each data source to the virtual data layer according to the logic defined by the system, and then the middleware compares the data processing logic according to the design time It processes, and finally the middleware returns the processed data in the format required by the calling interface.
The advantages of using a virtual data service layer are:
1. The processing is all on the middleware server. Relatively speaking, the data processing is more flexible, and the application and the underlying data are loosely coupled.
2. When a request involves multiple underlying data sources, access to the underlying data can be performed concurrently.
3. With the help of the flexibility of middleware, data can be provided with external interfaces in various ways, which greatly facilitates the development of various applications.
4. All data is taken from the data source in real time to ensure the timeliness of the data.
The problem with this is that data processing is carried out in the middleware layer. One is to bring data transmission from the data source to the middleware layer. The other is that middleware is generally a J2EE architecture. Its strength is not data processing. When the data volume is very large, the implementation mechanism is doomed to have problems with efficiency.
The second is to process the data at the data source layer, and then publish the integrated data to the middleware layer through a standard interface, and the middleware layer is responsible for data access.
This processing method is generally recommended by database vendors or ETL vendors. According to the user’s business requirements logic, the data conversion process is first designed at the data source layer through ETL tools, and then the process conversion logic is published into a web service. The data is also published as web services, and then these services are registered to the middleware layer. When the front-end user needs data services, it needs to call two web services. The first is to transform the web service, and the web service calls the corresponding ETL tool The data is integrated and processed, and then the integrated data is stored in a temporary table. The second service is to call the data service to directly retrieve the processed data from the temporary table. The difference from the first mode is that it puts the data processing in the data source layer, and its advantages are:
1. ETL tools are born to do data integration, and are suitable for the integration of large amounts of data, so the efficiency for large amounts of data will be very high.
2. Integrating at the data source layer can make full use of the processing power of the database, after all, the database is an expert in data processing.
3. Rely on the change data capture function of the E-LT tool to process incremental data.
4. Data conversion and data acquisition are loosely coupled, allowing asynchronous processing.
The problems with this model are:
1. Since the processing of data depends on the processing capabilities of the database, one of all data sources must be a relational database system, and the first mode is the middleware responsible for data processing, and the data source is not limited.
2. In the process design of the application, it is necessary to call the WEB service twice, once for conversion and once for data reading. In the case of very small data volume, it is a bit superfluous.
The third is to integrate the data scattered in the data layer into the ODS or data warehouse for integrated processing, and then publish the processed data to the middleware layer through a standard interface.
In order to ensure that a global data view is provided for the enterprise, we can establish a global operational database ODS (operational data storage), which is synchronized with other data sources in the enterprise through change data capture. When the data in the data source changes, CDC will capture the changed data and synchronize it to the ODS database through ETL tools or other means (such as master data management tools).
The last point is the data publishing format. In this mode, the middleware layer is responsible for data access. The data in the ODS can be encapsulated into a WEB service and published on the middleware layer. When the front-end business process requires integrated data, you can directly access the data in the ODS. If the data integration is more complicated, we can use ETL tools or other tools (the second mode) to analyze the data in the unified model layer according to the user’s business needs. Processing is placed in the summary data layer, and then the data is accessed from the summary data layer.

The fourth is to use the data grid method to integrate the data of the data layer in the middle layer to form a data grid. The middleware is responsible for data processing and integration, and then publishes it in a standard way. It is very similar to the first method. The integration, processing and release of data are all on the middleware layer. The only difference is that we use data grid technology to add an object caching layer to the middle layer. Data integration, processing and access The entry occurs in the middleware layer. When the client accesses data, all the process methods are no different from the first mode, but the data that needs to be accessed is cached in the middleware layer through the data grid layer, thus reducing data The time of source access and network transmission, the access speed will be greatly accelerated, which can solve the shortcomings of the first mode to a certain extent, but the data processing still occurs at the middleware layer. If the middleware processing capacity is limited, the efficiency of the system will increase. Be limited.
Advantages of this model:
1. The system has good scalability. The scalability of the data grid layer determines the scalability of the entire system.
2. When the processing power of the machine is insufficient, the performance can be greatly improved by clustering.
3. Really realize the loose coupling of foreground data and background data sources. The data grid is responsible for the interaction with various back-end data sources.
His question is:
1. The processing and sorting process of middleware layer data still exists.
2. If the application is already online, the application needs to be modified for the interface provided by the data grid.
Each of the above four modes has its own application range. In general, the closer the data processing is to the bottom layer, the higher the efficiency and the worse the flexibility; the higher the process, the lower the efficiency and the better the flexibility; in fact, various data The integration model does not matter whether it is good or bad. The key is to look at business needs, as long as the business needs can be met.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

N Ways of Data Integration

Contact Us

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support