Optimization principle of DataStage

Last Update:2015-08-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

one of the guiding principles of DataStage Job Optimization: Optimization of algorithms.
Optimization of any program, 1th first is the optimization of the algorithm. Of course, this is not only limited to the optimization of computer programs, in real life can be reflected in this point everywhere. All roads lead to Rome, and there are many ways to accomplish anything. And the method of course has the advantage and inferior, has the inefficiency and the efficiency. Therefore, to improve the efficiency of any one thing, the first is the optimization of the way things are done. Embodied in the computer program, that is, the optimization of the algorithm. Only the optimization of the algorithm can make the efficiency of doing things 10 times times, hundred times, even tens of thousands of ascension.
But in the actual job development process, most people will ignore this point. The reason is very simple, most people think that job development is a very low-level work, the most commonly used stage may be less than 10, skilled use of the 10 stages of the stage is not afraid of job development is not good. Indeed, in the actual development of the job, it may only use no more than 10 stages. The most important things are Oracle stage, Lookup stage, Join stage, Transformer stage, etc. However, how to use the right stage in a suitable scenario, how to balance the load balancing of DataStage with the database, how to determine which tables to associate with, and how the order associated with these tables is best, etc., is a problem to consider. The ability to develop a job completion requirement is not difficult, and it is difficult to achieve the required functionality with less resource consumption and more efficiency.
DataStage Job Optimization Guideline II: Minimize the amount of data the DS needs to process.
This, in simple terms, refers mainly to two points. The first is to minimize the amount of data extracted from the database to the DS temporary buffer (including the number of data record strips and data bytes), and to avoid unnecessary data processing during the DS internal processing.
But it's easy to say, it's hard to do! Open a job,80%. There may be one or two of the above mentioned problems.
First of all, for the 1th, often found that the job from the data source to extract hundreds of thousands of or even millions of data to the DS, followed by a small table (20W of data) within the association, the associated data, may only be extracted from the data source data one-third or even one-tenth. So why not consider using SQL in the database to do the internal association of the two tables? This significantly reduces the amount of data extracted from the source table, reduces the time it takes to extract the data to the DS, and reduces the use of temporary buffer space for the DS server.
Second, for the 2nd, a typical one is the use of the Remove Duplicate stage. Personally, all the jobs that use this stage should be carefully examined, whether it is really necessary to use the stage to complete the data deduplication. First of all, the efficiency of the stage is rather low, but where does the duplicated data come from? When extracting from a source table, does the source table have data duplication? Or does the association cause duplication of data during job processing? No matter what kind of duplication, you should avoid extracting duplicate data from the source to be processed in DS.
DataStage Job Optimization Guidelines III: Minimize the number of stages used.
In DS8.5, the job runtime generates a thread or process for each stage corresponding to the processing. When you run a job with a high number of concurrent runs, the system needs to process too many threads or processes.
DataStage Job Optimization Guideline Four: Try to balance the processing burden of the DS server with the database server.
When two tables or multiple tables are associated, are they done in the DS server or in the database server, which requires a balance of performance based on the DS server and database server. In addition, for some too complex multi-table associations, it can also be split to extract data into the DS for associative operations.
DataStage Job Optimization guideline Five: Give full play to the strengths of each stage.
Each stage has its own rationality, or why would IBM's engineers have to go through so much to develop the stage for DS?
But are all the stages in the right? The actual may not be. For example, how many people use the lookup stage and a small table to do an inner association? Hey! Does the Lookup stage also implement internal associations? Yes, he really can! When the Lookup stage is associated with a join stage, when the associated right table is duplicated, how many data will be associated with it? The Lookup stage really can!
DataStage Job Optimization Guidelines VI: Try to use a more efficient stage and minimize the use of inefficient stage.
Of course this depends on the specific implementation of the function. For example, which of the lookup stage and join stage should be used? Because the lookup stage loads the right table into memory, it is much faster to handle than the join stage. However, when the associated right table is a large table, putting the entire table's data into memory can take up a lot of memory space and may even lead to insufficient memory for the job to fail. What is a big table, what is a small table, this is an empirical value, not static. When the server's memory is large enough, 1000W of data into memory, but also occupy the memory space of the bucket, 1000W table is also a small table. Our project team used the threshold value is 100W, the right table is more than 100W, try to use the join Stage.
Another, like the Remove Duplicate stage mentioned above, is a fairly inefficient stage that should reduce the use of similar inefficient stage.
Temporarily also think of the above points, looks simple, but can use every point to the extreme, but it is a difficult thing!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Optimization principle of DataStage

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Optimization principle of DataStage

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support