Application practice of extension method of Infosphere DataStage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We know that Infosphere Datastage is a powerful tool for data extraction, conversion and loading, and is widely used in user information integration projects. It not only provides a rich data interface, can connect a wide range of mainframe, database, ERP/CRM and other enterprise applications and external information resources, it also provides dozens of data conversion stage and hundreds of data conversion functions, can meet the needs of our data conversion, which, more commonly used stage Mainly include:

Stage for data sources and targets: connector, sequential file, data set of various databases

Stage:lookup, joins, merge, aggregator for data consolidation

Stage:transformer, Remove duplicates for data conversion

Auxiliary Stage:row Generator, Peek, Sort

Typically, the phase provides 80% to 90% of the application logic required for most enterprise data consolidation applications, while Infosphere DataStage also provides C + + and Java programming interfaces that allow us to develop and extend the transformation capabilities that meet specific requirements.

From the underlying technology architecture, Infosphere Datastage provides a parallel, extensible application framework, as shown in the following illustration, the parallel, extensible application Framework consists of four parts: application components, application frameworks, configuration files, and application services. Each application component completes the different functions of data processing and application analysis. which

Figure 1. Datastage Parallel Extended Application framework

Application component: Is the basic processing unit that implements the application processing logic, including the interface with database, data connection operation, analysis, sorting, grouping aggregation operation, etc., all these application components are connected together to complete the data processing application operation. This parallel, extensible application Framework also allows users to customize their application processing components by wrapping existing applications and developing C/COM + + applications, all of which can be reused and provide parallel processing capabilities.

Application framework: Used to build parallel processing applications. The user can select a variety of processing components through a graphical interface or command line, and connect these components sequentially to implement a variety of processing processes. In the internal, the application framework will automatically realize the parallel processing operation of the operation, including load balancing, communication between partitions, data mode processing, multithreading operation and data buffering operation. Users do not need to care about parallel processing programming methods to achieve parallel processing efficiency, which is entirely done automatically by the application framework.

Configuration file: Used to specify the parallel processing capabilities of data processing applications on different hardware platforms. In general, the application logic and hardware platform coding are often intertwined in the applications developed by users themselves, so it is difficult to migrate between different hardware platforms. Infosphere Datastage uses configuration files to isolate hardware platform encodings, so you can easily implement migration across platforms without changing the application logic by changing the configuration file settings.

Application services: Used to provide data processing application of the operation of the temporal management, including the operation of multiple CPUs across a single program and a single operator control, centralized error information collection, performance monitoring, job checkpoint restart control.

The various stage provided in the Infosphere Datastage are mapped to various operators (Operator) in their parallel, extensible application framework, and some of the common stage have the following operators:

Table 1. Datastage operator Comparison

	Datastage stage	Operator
	Sequential File–source	Import
	Sequential File–target	Export
	Data Set	Copy
	Sort	Tsort
	Aggregator	Group
	Row generator, Column generator, Surrogate Key Generator	Generator
	Oracle–source	Oraread
	Oracle–sparse Lookup	Oralookup
	Oracle–target load	Orawrite
	Join	Innerjoin, Leftouterjoin, Rightouterjoin, Fullouterjoin

Infosphere Datastage Parallel and Extensible application Framework provides efficient parallel processing capability through data partitioning, pipelining and data zoning, and fully guarantees the execution efficiency of data integration. The so-called data partitioning technology is to distribute data to different data partitions through hash hashing algorithm, data rotation operation and random operation, and to improve the efficiency of data processing by parallel operation between different partitions. We can set the number of data partitions based on the amount of data on a single machine CPU and different machines. For example, if the system has 4 CPUs, we can distribute the data to 4 data partitions, in theory, execution efficiency is 4 times times of a single partition; the so-called pipelining technology, like the production line, After the last phase of data processing is completed, will be transferred through memory immediately to the next stage of processing, data does not fall, I/O operation, so the efficiency will be higher, at the same time, it combined with data partitioning technology, can ensure that each data partition between the ground data processing, to achieve efficient, parallel processing. The so-called data-partitioning technology, refers to the requirements of data processing, the definition of the partition, as far as possible to ensure that the data evenly distributed to different data partitions, to achieve the so-called co-location operations, that is, data connectivity, such as the operation of the same data partition, so that the implementation of the highest efficiency. All data processing stage provided in Infosphere Datastage are developed based on this parallel, extensible application framework, so all data processing stage provide parallel processing capabilities.

In addition, the Infosphere Datastage parallel, extensible application Framework provides very flexible scalability, and we can easily develop our own specific processing capabilities in a variety of ways. In Infosphere Datastage, the extension methods mainly include the following methods:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Application practice of extension method of Infosphere DataStage

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Application practice of extension method of Infosphere DataStage

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support