Big Data management: techniques, methodologies and best practices for data integration reading notes two

Last Update:2016-04-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Again, the data integration development process, batch data integration and ETL

Data Integration life cycle

1 determining the scope of the project

2 Profile Analysis

the second part of the life cycle is often overlooked, i.e. profiling. Because data integration is seen as a technical activity, organizations typically
Access to production data is sensitive, so analyzing data that is currently stored in possible source and target systems for the purpose of developing a data interface can be
More difficult things to do. Therefore, the analysis of the actual data is often the key to determine success or failure. Almost every data integration project will find
Some problems with actual data in source and target systems, and these problems often affect the design of the scheme to a large extent. For example: data is not
Contain unexpected content, lack of content, or poor data quality, even when some data is needed, the data does not exist at all.
Negotiations between the data owner and the security team will continue until an acceptable solution can be made to the source and target data involved in the
The line profile analysis.

3 includes business knowledge and expert experience

in the field of data management, data integration is often seen as a highly technical work that is filled with technical experts from the beginning to the contrary
The other extreme-data governance and data quality are almost entirely business-oriented processes. However, efficient data integration also requires
To understand the data transferred between the systems.
Many applications and projects that rely on data integration (Data transformation, data warehousing, master Data Management) have been postponed during the implementation phase, not because
Lack of technical solutions, but lack of operational knowledge and involvement of business people.
As a core process in data integration development, defining data transformations must be reviewed and validated by people who have a deep business understanding of the data.
Many tools can be used to try and deduce the relationship between data in different systems, either through the similarity between fields, or to the actual data content
for analysis. Ultimately, however, these derivations are only a few guesses that must be verified by the business people who actually use the data.
The real challenge is that in the data integration life cycle shown in 4-1, some steps are not easily separated and assigned to a technical
Operation or a business person. Requirements analysis and defining mapping and transformation rules these steps must be made by someone who is proficient in technology and business, or
People from multiple teams collaborate to complete. For example, to define mapping and transformation rules, the knowledge of the physical implementation of data, and how it is actually used in business
Knowledge of this data is required.

The technical and business knowledge of the source and target data is critical to the success of the data integration project. Therefore, every step in the program development cycle
Require close collaboration of resources from different functional areas, which is a challenging aspect of data integration project development

Batch processing Data integration

Most interfaces between systems typically exist in this way, that is, Periodic (daily, weekly, or monthly) large data files from a system
Transfer to a different system. The contents of the file are structurally consistent data records, and the sending system and receiving system must understand the format of the file and achieve a
Induced by. This process is called batch mode, because the data is organized into "batches" and sent periodically, rather than in an individual way in real time
Send. Figure 5-1 depicts the batch data integration process for this standard. Transfer data between two systems, that is, the sending system
This is also known as a "point-to-point" approach.

Batch-processing life cycle

One of the best recommended practices for successfully building a data interface is to summarize the actual data in the source and target system structures
Analysis. A summary analysis of production data helps to understand what the target data should be designed for and what sources should be
Of Basic profiling includes understanding the different aspects of the design of the data structure, not just the documents that are recorded or imagined in the east.
The West, such as uniqueness, density (empty and blank), format, and valid data.

Etl

The core function of data integration is to obtain data from the place where the data is currently stored, convert it to a format compatible with the target system, and then
it into the target system. These three steps are called extraction, transformation, and loading (Extract, transform, and Load,etl).

1 Profile Analysis

With Profiling Tools, you can get some reports about the format and content of the actual source data: invalid or empty data percentage, different value
, high frequency occurrences, and the format of the content. Some profiling tools can relate to field names or actual field content
Inference so that you can discover the relationship between data that is located in different data stores.
During the data profiling process, it is often found that the source data needs to be corrected and cleaned before the ETL is implemented. In some cases, you can automatically
Corrections, but more commonly, the data is manually corrected by a business person who is familiar with the data. Correct in the source data structure, or you will find
Issues into the target data store, which must be made by the business people.

2 extraction

In order to implement the extraction part of ETL, it is necessary to access the system or data store where the current data resides. Need a basic understanding of the format of these data and the internal
To choose which data to replicate.
There are two basic ways to extract data: Copy a piece of data from the current system or source system, or by another system, that is, a specific
Extract the system, participate in and crawl the data. The benefit of data extraction from the source system is that the current system (and system support staff) that stores the data
Understand the meaning of the data, as well as the technical solutions contained in the data. However, performing data extraction by the source system can cause problems with multiple potential sources. Usually
case, we need to extract the data from the source system is the production system, therefore, we do not expect to increase some of the additional data to extract the operation of it
adversely affect their production performance. Also, the source system's support staff may have been too busy to create the extraction task, or did not
There is training in data extraction techniques and tools, or data extraction is not regarded as a priority for their work.
The effect on the source system is small by using a particular extraction program to fetch data from the system that currently stores the data. Although
However, the use of existing data storage engines such as database management systems or file management systems can still lead to some resource competition problems. Development
Data extraction applications are also trained in data extraction techniques and tools

3 Temporary storage

The result of the data extraction process is usually a "file" that contains data that will be copied and transferred to another place, usually another
A server or platform. Save the extracted files on the source server platform and copy the extracted data to the target server platform or any of the
can be referred to as "data staging". In general, there are data staging points on both the source server platform and the target server. Staging data can be
To allow for tracking and auditing of data sent and received, as well as timing processing of data to allow loose coupling between source and target systems or asynchronous
Processing, that is, the two systems do not need to work together at the same time to process the data, of course, they can do so, as long as this is done for every
Independent processing of a system, because data is always extracted from the source system first.
However, reading and writing from disk, also known as I/O (input/output), is always very slow compared to memory data processing. In the design of the ETL scheme
, bypassing the data staging point allows the process to be processed more than 10 times times faster, which is a big difference in speed-first situations, but
This also loses the benefits of tracking audits of data and loose coupling between systems.

4 access levels

In order to design a decimation scheme, it is often necessary to understand the different levels of security access taken to protect the data being extracted, although some or all

The level of access may be logically invisible or virtual, and management of access may be automated. The level of access involved includes at least
Organization Network and Firewall layer, server layer (or physical layer), operating system layer, application layer, and data structure layer.
When integration needs to span an organization, it must span a secure access layer in order to enter the domain of interest for an organization. In most cases, the
Logically this is implemented with firewalls to protect the organization's network.

5 Conversion

The process of converting data and being compatible with the target data structure can be very simple or difficult to gather additional information. Number
The conversion process requires very detailed business requirements, which are typically developed or endorsed by the business owner of the data. and pumped and loaded.
Process is almost entirely technology-oriented, so in addition to ensuring that the right data is extracted, there is little business review involved.

5.1 Simple Mapping

5.2 Lookup Table

5.3 Aggregation and normalization

5.4 Calculation

5.5 Loading

The next step in the ETL process is loading, which is responsible for loading the data into the target data structure. Often see the extract, transform, reload
As well as extracting, loading, and re-converting, the two different sequential processing methods which are more efficient in some discussions. The difference between the two is the conversion step
is performed on the source server or on the target server (or on a separate server). This topic and the big data volume processing process and
What kind of engine processing speed is fast related. Furthermore, there is no data staging on one or more points during the data extraction process that can significantly affect
The speed of management. In short, the next step in ETL is to load data into the target data structure, whether physical or virtual.
There are two main ways to load data into the target data store, either by inserting the data directly into the code or by using an existing application generation
Insert data into the target data store. However, whenever possible, it is good to use existing application code, as these codes incorporate information about data that is as
Knowledge of what is stored in the target data structure. Although the current application code may not be developed for large data loads, it cannot be used when loading
window, however, it is still a good choice to optimize the loading process and use existing code relative to writing separate code

Big Data management: techniques, methodologies and best practices for data integration reading notes two

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Big Data management: techniques, methodologies and best practices for data integration reading notes two

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Big Data management: techniques, methodologies and best practices for data integration reading notes two

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support