We know that Infosphere Datastage is a powerful tool for data extraction, conversion and loading, and is widely used in user information integration projects. It not only provides a rich data interface, can connect a wide range of mainframe, database, ERP/CRM and other enterprise applications and external information resources, it also provides dozens of data conversion stage and hundreds of data conversion functions, can meet the needs of our data conversion, which, more commonly used stage Mainly include:
Stage for data sources and targets: connector, sequential file, data set of various databases
Stage:lookup, joins, merge, aggregator for data consolidation
Stage:transformer, Remove duplicates for data conversion
Auxiliary Stage:row Generator, Peek, Sort
Typically, the phase provides 80% to 90% of the application logic required for most enterprise data consolidation applications, while Infosphere DataStage also provides C + + and Java programming interfaces that allow us to develop and extend the transformation capabilities that meet specific requirements.
From the underlying technology architecture, Infosphere Datastage provides a parallel, extensible application framework, as shown in the following illustration, the parallel, extensible application Framework consists of four parts: application components, application frameworks, configuration files, and application services. Each application component completes the different functions of data processing and application analysis. which
Figure 1. Datastage Parallel Extended Application framework
Application component: Is the basic processing unit that implements the application processing logic, including the interface with database, data connection operation, analysis, sorting, grouping aggregation operation, etc., all these application components are connected together to complete the data processing application operation. This parallel, extensible application Framework also allows users to customize their application processing components by wrapping existing applications and developing C/COM + + applications, all of which can be reused and provide parallel processing capabilities.
Typically, the phase provides 80% to 90% of the application logic required for most enterprise data consolidation applications, while Infosphere DataStage also provides C + + and Java programming interfaces that allow us to develop and extend the transformation capabilities that meet specific requirements.
Application framework: Used to build parallel processing applications. The user can select a variety of processing components through a graphical interface or command line, and connect these components sequentially to implement a variety of processing processes. In the internal, the application framework will automatically realize the parallel processing operation of the operation, including load balancing, communication between partitions, data mode processing, multithreading operation and data buffering operation. Users do not need to care about parallel processing programming methods to achieve parallel processing efficiency, which is entirely done automatically by the application framework.
Configuration file: Used to specify the parallel processing capabilities of data processing applications on different hardware platforms. In general, the application logic and hardware platform coding are often intertwined in the applications developed by users themselves, so it is difficult to migrate between different hardware platforms. Infosphere Datastage uses configuration files to isolate hardware platform encodings, so you can easily implement migration across platforms without changing the application logic by changing the configuration file settings.
Application services: Used to provide data processing application of the operation of the temporal management, including the operation of multiple CPUs across a single program and a single operator control, centralized error information collection, performance monitoring, job checkpoint restart control.
The various stage provided in the Infosphere Datastage are mapped to various operators (Operator) in their parallel, extensible application framework, and some of the common stage have the following operators:
Table 1. Datastage operator Comparison
|
Datastage stage |
Operator |
|
Sequential File–source |
Import |
|
Sequential File–target |
Export |
|
Data Set |
Copy |
|
Sort |
Tsort |
|
Aggregator |
Group |
|
Row generator, Column generator, Surrogate Key Generator |
Generator |
|
Oracle–source |
Oraread |
|
Oracle–sparse Lookup |
Oralookup |
|
Oracle–target load |
Orawrite |
|
Join |
Innerjoin, Leftouterjoin, Rightouterjoin, Fullouterjoin |
Infosphere Datastage Parallel and Extensible application Framework provides efficient parallel processing capability through data partitioning, pipelining and data zoning, and fully guarantees the execution efficiency of data integration. The so-called data partitioning technology is to distribute data to different data partitions through hash hashing algorithm, data rotation operation and random operation, and to improve the efficiency of data processing by parallel operation between different partitions. We can set the number of data partitions based on the amount of data on a single machine CPU and different machines. For example, if the system has 4 CPUs, we can distribute the data to 4 data partitions, in theory, execution efficiency is 4 times times of a single partition; the so-called pipelining technology, like the production line, After the last phase of data processing is completed, will be transferred through memory immediately to the next stage of processing, data does not fall, I/O operation, so the efficiency will be higher, at the same time, it combined with data partitioning technology, can ensure that each data partition between the ground data processing, to achieve efficient, parallel processing. The so-called data-partitioning technology, refers to the requirements of data processing, the definition of the partition, as far as possible to ensure that the data evenly distributed to different data partitions, to achieve the so-called co-location operations, that is, data connectivity, such as the operation of the same data partition, so that the implementation of the highest efficiency. All data processing stage provided in Infosphere Datastage are developed based on this parallel, extensible application framework, so all data processing stage provide parallel processing capabilities.
In addition, the Infosphere Datastage parallel, extensible application Framework provides very flexible scalability, and we can easily develop our own specific processing capabilities in a variety of ways. In Infosphere Datastage, the extension methods mainly include the following methods: