IBM InfoSphere DataStage 8.1 DataStage Job Development Detailed

Source: Internet
Author: User
Tags case statement define local switch case

Brief introduction

DataStage uses the Client-server architecture, which stores all the projects and metadata on the server side, and the client DataStage Designer provides a graphical development environment for the entire ETL process. The process of integrating and loading the extraction and cleaning transformation of data by means of WYSIWYG design. Datastage's operational unit is the Datastage job, which allows the user to design and develop Datastage job in designer. The job in Datastage is divided into server job, Parallel job and Mainframe job, where the Mainframe job is for mainframe use, the job used is server job and Parallel Jo B. This article describes how to use the Server job and Parallel job for ETL development.

Server Job

A Job is a Datastage unit that can be run. The Server job is the simplest and most common type of job, and it uses a drag-and-drop method to drag and drop the basic design unit-stage into the workspace, and to represent the flow of data by wire. The following functions can be implemented through the Server Job.

    1. Define how data is extracted
    2. Define Data Flow
    3. Define a collection of data
    4. Defining conversions for data
    5. Defining constraints on data
    6. Define the data's load
    7. Defining the writing of data
Parallel Job

The Server Job is simple and powerful, and is suitable for rapid development of ETL processes. The difference between the Parallel job and the Server job is that it provides a parallel mechanism that can quickly improve data processing efficiency in the case of multi-node support. The Parallel job contains more stages and is used for different requirements, and each stage is often more restrictive than the Server job.

Sequence Job

The Sequence job is used for collaborative control between jobs, using a graphical approach to bring together multiple jobs, specifying the order of execution, logical relationships, and error handling between jobs.

Connection to the data source

DataStage can directly connect a very large number of data sources, the application is very wide, the data sources can be connected include:

    • Text file
    • XML file
    • Enterprise applications such as SAP, PeopleSoft, Siebel, Oracle application
    • Almost all database systems, such as DB2, Oracle, SQL Server, Sybase ase/iq, Teradata, Informix, and databases that can be connected via ODBC
    • Web Services
    • SAS, WebSphere MQ

Back to top of page

Stage Overview In the Server jobserver Job

Stage is the basic element that forms the Datastage job, and the stage can be divided into the following five types in the Server job:

    1. General
    2. Database
    3. File
    4. Processing
    5. Real time

This section describes how to develop a Server Job using Datastage. 1 is shown below:

Figure 1. Server jobsequential File Stage

The sequential file Stage can be used to get source data from a sequential file or to load data into a sequential file. You need to specify the path and name of the file, the format of the file, the definition of the column, and the type of file write (overwrite or append) when using sequential file Stage.

Figure 2. Sequential File attribute block Diagram 3. Sequential File Column Definition

Is the Sequence File used in the example in this section. In the Input page, the file Name parameter represents the actual path of the files and will be automatically established if the file does not exist. The Update Action selects the Overwrite existing file to indicate that the files will be emptied before the data is loaded, and in the Format page, define the formats of the file, such as separators, NULL values, whether the first row is a column definition, etc. in the column page, you need to enter the file's The column definition.

Hash File Stage

A Hash file that divides records into one or more parts of a primary key is usually used as a reference lookup in Datastage. In the reference lookup, the Hash file is loaded into memory and therefore has a high lookup efficiency.

Similar to Sequence file, you need to enter the actual address of the file when you use the Hash file, set the option to write with the parameters, and provide a column definition of the data. Note that the hash File needs to specify the primary key, and if not specified, the first column is the default primary key. When a parameter lookup is made, the primary key value is used to search in the Hash File, and if found, the data is returned, and a NULL value is returned if it is not found.

Figure 4. Hash File property Box Transformer Stage

The Transformer stage is an important, powerful stage. It is responsible for data conversion operations in the ETL process. In Transformer Stage, you can specify the source and destination of the data, match the corresponding input fields and output fields, and specify the conversion rules and constraints.

Figure 5. Transformer Stage Column Mappings

The Transformer Stage is divided into 5 regions:

The upper-left area is the input data information described in tabular form. If there are multiple input data streams, there are many tables. In this example there is an input, a reference query, so there are two tables at the top left.

The upper-right area is the output information described in tabular form.

The lower-left area is the input metadata column definition, including the column name, type, and length properties.

The lower right area is the metadata column definition for the output, including properties such as column name, type, and length.

The upper-left and upper-right tables are connected by arrows with flow direction, representing the corresponding relationship of the fields. In this example, the data entered is only one field employee_id, and this field is referenced in the Hash File to get the Employee_Name field. If the employee_name is found in the Hash File, the data is sent to the output, this condition is implemented by the Transformer Stage, which we define as not (ISNULL (lkp_name) in the constraint. employee_id)). In addition, whether or not to find the corresponding data in the Hash file, we record the data into a CSV, that is, the corresponding save_all output.

Back to top of page

A review of the Stage of Parallel Jobparallel Job

The Parallel job provides a richer stage than the Server job, adding the stage for the Development/debug,restructure and transactional classes. At the same time, for some functions that can be done in transformer in the server job, the Parallel job also provides a dedicated stage to improve operational performance and development efficiency, such as Lookup,join,compare. Another notable difference is the built-in support for job parallelism in the Parallel job, and parallel execution means that the data needs to handle partition and combination at each stage of the job, so in developing the job we need to To set the partition and combination policies.

Lookup DataSet and Lookup Stage

Parallel job to make some adjustments to the implementation of the lookup, in the Server job, we generally use the Transformer Stage with the lookup data source (usually a hash file) to achieve the lookup, the same transform You can do multiple lookup simultaneously in ER, similar to a multi-table natural join in SQL, and if the lookup data source uses database stage instead of hash file and returns multiple lookup data for a record, the job produces WA Rning (the key unique attribute of the hash file does not have this problem, and the duplicated data inserted later overwrites the previous primary key data).

In Parallel Job, lookup needs to be implemented with a separate stage, transformer no longer part-time lookup "sideline", in a lookup stage, you can have a master data link and multiple lookup link 。 At the same time, the lookup in Parallel has the following new features

    • Support multi rows, in a lookup stage, you can have a lookup link that returns more than one row of lookup data for a row of primary input data. The results will also become multiple lines.
    • Parallel does not support hash file, instead of using the more encapsulated data set stage, the data set is essentially a hash data structure, but to the JOB developer hides the implementation details, we do not have to do the same as the development Server Job to manually set the detailed Fine parameters
    • In addition to support for equivalent lookup, Parallel supports Range lookup and caseless lookup directly. This makes it very convenient and natural for us to complete a design that is converted to a quarterly nature like a month.

Similar to the hash file in the Server job, in the Parallel job we use the data set file to cache the lookup data and load it into memory, in the data set stage, we only need to make a record of the primary key and stored file name, Par The Allel engine will handle other operations for us. But in order to achieve the optimization of performance, we sometimes need to set the Data set cache policy and cache size, the system default cache size is 3M, if our lookup data is larger, we need to set the appropriate cache size, otherwise it will seriously affect the performance of the lookup.

Figure 6. DataSet Cache Settings Sort Stage

The Parallel Sort stage behaves like an order by in SQL, but provides more options than the order by. In the job, the Sort stage receives an input link and produces an output link. It should be simple for developers who have written SQL order by or sort programs to use the sort Stage, but to give full play to the power of Parallel stage, we still need to pay attention to the following points:

    • Parallel or serial execution, if serial execution is selected, the sort stage behaves like the sort stage in the Server Job, and the entire input data is sorted according to the sort options set, but if you choose partition/parallel sorting, only the output within each partition is ordered, This is acceptable in some cases, but in other cases leads to code flaws that need to be selected according to the sort's follow-up action.
    • If possible, try to sort on the database in the data source side, which will not only improve the efficiency of data sorting, but also greatly reduce the pressure of job on memory and I/O. The sort stage can be sorted only after it has been received and output data so that the job's subsequent stage is in a wait state.
    • Similar to the list of fields following order BY, we can specify the direction of the sort, ascending or descending, and sort Stage can also specify that more than one field be sorted, the preceding column is called the main sort field, and if one or more fields in the sort field are already ordered, I can also specify that it is ordered so that it can improve the efficiency of sorting when sorting.
    • Stable sort (Stable sort)/allow repetition, stable sort by default is yes, so if the two record sort key is the same, the output and input order of the sort will be the same, if there is no option to allow duplicates, two or more records of the same sort key, will only protect Leave a record.
    • Limit the use of memory, the data sorting operation is very memory-intensive, if unrestricted, so that all the data sorted in memory, the job's other operations or other job execution efficiency will be severely affected, all in the sort Stage, We can set the maximum amount of memory (M) that this sort can use, so we can find a balance between acceptable sorting efficiency and the amount of memory used.
Compare/difference/change Capture Stage

Compare, Difference and change Capture stage is the Parallel job in the three to compare the similarities and differences between the stage, for the three stage itself is not much difficult to use, the basic parameters and settings are very concise and intuitive, I The introduction of the main focus on the three stages in the use of the same point and different points, once the characteristics of the stage, the use of not only can choose the correct stage according to the requirements, but also according to the stage characteristics to know what parameters need to be set.

Same point:

    • There are two inputs that produce an output,
    • The field names of the data primary keys you enter are the same, and you need to specify the fields you want to compare.
    • A result field of an integer is added to the resulting data to represent the comparison of two rows of data

Different points:

    • Capture change Stage output is based on the after input stream, plus the Before Code field, which is suitable for use with the turn apply, synchronizing the input stream to the same as after.
    • The output of the difference Stage is based on the before input stream, plus the Change Code field
    • Results from the Compare Stage include before and after, and the Change Code field

Here is an example of Capture change Stage:

Figure 7. Example of Capture change Stage
Before source Sql:select k,v from (Values (), (2,2), (3,3), (bis), (5,5), (6,6), (7,7), (8,8), (9,9), (10,10)) as temp (k,v) ORDER BY k ASC through Source Sql:select k,v from (Values (2,2), (11,11), (4,5), (5,5), (6,6), (7,7), (8,8), (9,9), (10,10)) As temp (k,v) Order by k ASC
Figure 8. Capture Change Stage parameter settings

From the above settings can be seen, we have chosen to explicitly specify the primary key, the remaining column as value, for the results of the comparison, if the result is the same, then removed from the results, that is, we only want to see the before data do Delete,edit, and insert after the difference, Is the result of our job run:

Figure 9. Comparsion Results

As you can see from the above results, there are three differences between before and after data, and the value of Change_code corresponds to 2,3,1. Represents the difference in the execution of Delete,update,insert to before, respectively. To synchronize these differences, we only need to perform the corresponding delete,update and inserts on the before data to enable synchronization of the two data sets.

Filter Stage

The filter stage, as its name implies, is a stage for filtering, which acts like we write WHERE clauses in SQL, and its supported logical expressions and operators are similar to the WHERE clauses of SQL statements, for example, in the filter Stage, we can use the The following common logical expressions and operators, and their combinations,

    • True and False
    • Six comparison operators: =, <>, <,;, <=, >=
    • Is null and is not NULL
    • Like and between

From its grammatical role, it is similar to the switch case statement in Java or C, we can decide whether to add a break after each case when clause by setting the value of the "Output Row only Once" option by adding or removing "R Eject Link "to determine whether to add a default clause. The following is a simple example. Shows how we filter employee records to different result files through the combination of employee number and salary.

Figure 10. The Filter Stage example is shown in Figure 11. Settings for the Filter Stage

For each where condition, we need to set the corresponding output link, which is an integer number, and we can find the correspondence between the number and the name of the output link on the "Link ordering" tab.

It is also important to note that the Filter Stage does not make any changes to the input records and only distributes them. However, you can manually set the output column so that the column list for each output is different, but as long as it is a subset of the column list, but for Reject Link, the column list is exactly the same as the input and cannot be changed.

Stages for debugging

We know that the debug feature is available in the DataStage Server job, and we can have the job run in Debug mode when we encounter problems during the development process, looking closely at the flow and conversion of each row of data between the stages in the job, but Parallel The job does not provide us with debugging capabilities, but the Parallel job provides the ability to debug in another way: The Parallel job has a built-in stage for debugging, and using these stages, we can output the intermediate data that we suspect to be problematic as we need to, You can then drill down to the root cause of the problem. The following Stage is available for debugging in the Parallel Job:

    • Head Stage
    • Tail Stage
    • Sample Stage
    • Peek Stage
    • Row Generator Stage
    • Column Generator Stage

We show the use of the development/debug stage as an example of peek, and the other stages are similar in usage, and can be found in the following tables and documents. As the Source Sql for our job,db2 Stage is as follows:

SELECT k,v from (Values (2,2), (3,3), (bis), (5,5), (6,6), (7,7), (8,8), (9,9), (10,10), (11,11), (12,12), (13,13), ( 14,14), (15,15), (16,16), (17,17), (18,18), (19,19), (20,20)) as temp (k,v)
Figure 12. Example of Peek Job Figure 13. Peek Stage Settings Figure 14. Results of Peek

The following table outlines the features and usage of these stages

Table 1. Features and usage of Stage
Stage Type Use items can be set
Head Stage Fetching data from the input stream starting from the head
One input one output
Number of rows fetched
Which partitions to crawl from
Starting position of each partition
Interval of each FETCH
Tail Stage Fetching N rows of data at the end of the input stream
One input one output
Number of rows fetched
Which partitions to crawl from
Sample Stage One input, multiple outputs fetch data from each partition of the input stream according to the set policy, each output stream has a different percentage setting Percentage
Seed of a random number
The maximum number of rows fetched per partition
Peek Stage Selectively "peeping" through data from a data stream, one input, two outputs, one output as output input data, one output generating a text containing "peeping data" Interval of each FETCH
Number of rows fetched per partition
"Voyeur" which lines
Output "peeping" results to log or output
Row Generator Stage Generate analog data based on defined data schema, no input, one output Schema (column list or schema file)
Number of rows generated
Column Generator Stage Add a new column on an input stream to generate simulation data for these newly added or existing column Column to generate the simulation data

Back to top of page

Sequence Job

If each Server job or Parallel job completes the ETL process of a data extraction, conversion, loading of the sub-process, then the role of Sequence job is to concatenate these sub-processes to form a complete global ETL process. Structurally, a Sequence Job resembles a function of a C or Java language, but is more powerful.

  • You can use Uservariables Activity Stage to define local variables, variables to be assigned when defined, assignment expressions to be system variables, JOB parameters, Datastage macros, constants, Routine return results, etc. It can also be the condition of these individual variables, the result of a mathematical or string operation. Almost everything you can do with local variable definitions in a function can be implemented here. The following example defines six variables.
  • Other function modules can be called, and the Server job,parallel job can be called through the job Activity stage, and unix/windows Cmd is invoked through the Execute Command stage via Routine Act Ivity supports calling DataStage routine.
  • Supporting loops, the Sequence Job provides a looping function through the Startloop activity stage and the endloop activity stage. A loop variable can be an integer loop based on a starting value, an ending value and a step, or a loop based on a given list, and a temporary variable in those loops can be passed to each specific loop step. Between the Startloop activity stage and the endloop activity stage, you can add any number of
  • Supports logical operations, Nested Condition stage supports switch-like logic, Sequencer stage support and and or logical operations, through which the combination of these stages, can support arbitrary complex logic control.
  • Support email notification, using Notification Stage, when the job runs successfully, fails or satisfies other set conditions, Sequence job can send one or more notification messages, so that we can more easily monitor the operation status of the job, the content of the message can contain J The operating state of the OB, the current parameters, and so on, all variables that can be referenced in the User Variables Stage can be included in the message, as well as the running state of the file and Sequence Job that we specify.
  • With support for error handling and field cleanup, using Terminator Activity stage and Exception Handler stage, we can define the errors that need to be handled, and use the defined policies to stop unnecessary Job runs in the event of an error.
  • Wait time can be supported by the wait for file activity stage, and we can define that only a certain signal file will appear or disappear before starting the wait for file activity stage subsequent execution.

The following figure shows a simple Sequence job, at the beginning of the job, we define a set of variables that will be referenced when we loop and send notification messages. Then use the job parameter yearlist start the loop, each loop inside we call a job Extract_trans, if the job call execution is unsuccessful, we send an email notification Job failed, otherwise into the next loop, after the end of the loop, email notification seque The nCE Job ran successfully

Figure 15. Job Panorama Figure 16. The variables are defined in Figure 17. Startloop Stage is defined in Figure 18. Job Activity Triggers Definition

IBM InfoSphere DataStage 8.1 DataStage Job Development Detailed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.