Optimization of traditional Data Warehouse projects (for oracle+datastage)

Source: Internet
Author: User

Common meanspartitions, hash-join, Data Warehouse functions, materialized views, bitmap indexes, etc. are common in the Data Warehouse technology,while the tips listed below are the most commonly used optimization tools/techniques in the project, the Green background highlight part belongs to unconventional means, the use of the scene is more extreme, need careful consideration.   Oracle Parallel Scenario:
  • Parallel parameters of Sql*loader
  • parallel processing of transaction failure rollback fast_start_parallel_rollback Parameters
  • EXPDP set parallelism parameter, set multiple DataPump files
  • High-volume processing + parallel processing(parallel), reduce the number of select times, logic clear, as soon as possible select......jion after the unified analysis function processing, Select/*+ PARALLEL (table_name, number of parallel) */statistic function sum avgCase and then else endOver (partition by order) analysis function lead/lag,rank,ratio_to_report,period-over-period comparisons etc. .. from table_name GROUP BY
    rollup, cube, etc. ..
  • Create indexes, rebuild, set parallel parameters (such as high-volume ETL full amount, DROP index, ETL and create)
  • Degree parameters for collecting statistics
  • There is also the Aleter session enable parallel DML;
    Insert/*+ Append parallel (table_i, number of parallel) */
    Into Table_i nologging
    Select/*+ PARALLEL (A, parallel number) PARALLEL (B, number of parallel) PARALLEL (C, number of parallel) */
    ......
nologging are often useful when DML   Insert, updateInsert, Update,delete scene
1, of course, the fastest is still the CREATE table New_tab as select * from Old_tab 2, delete when the amount of data is too large, you can consider the whole amount of export establishment:CREATE table New_tab nologging PARALLEL as SELECT * from Old_tab where condition <> data to delete, truncate original table, Rena Me rename the new table. Update can also be the same, the idea of update to write to select and then truncate the original table, rename new table.   MERGE: (You can also use parallel, nologging )reduce the number of scan tables, instead of the insert then UPDATE statementFor example: monthlycalculates the cost of the ledger account in the production library and puts it into the fact table of the data Warehouse, but a small number of reversal vouchers can affect operations in recent months. The original total Delete+insert, or the Insert new Data + Time range update operation, in the merge where time overrides the scope of the possible reversal can occur. Parameters:1.alter session set workarea_size_policy=manual;alter session set sort_area_size=107341824;alter session set sort_area_retained_size=107341824;alter session set DB_FILE_MULTIBLOCK_READ_COUNT=128;in parallel DML, parallel Select ETL job can add as above parameters, 10g version seems to repeat two times before it can take effect. Db_file_multiblock_read_count with 32k, 16k of large block table space for the traditional SATA disk, FC disk effective,increase the benefit of a single IO.   Similarly, sometimes the inverse paradigm redundancy multi-dimensional and the fact table to a piece, forms the long table,the db_file_multiblock_read_count+32k/16k big Block's performance is also very good, but the data application scene reduces, Most are used for temporal topic analysis, data marts.  2.modifying Parameters _smm_auto_min_io_size, Smm_auto_max_io_sizeincrease the memory allocation size of each hash join, improve group by performance, with the large PGA.   3, Extreme environment (or test environment, or synchronization data of non-critical transition library), open the parameter alter system set commit_write= ' batch,nowait '; (features that 10gr2 start with) This allows the db to commit without waiting for the LOG buffer to write to redo LOGFILE, which returns the commit completion, need to assess the risk of a power outage when a disaster occurs, if a UPS can be considered open.   Note: The extreme environment refers to the frequent commit log file sync waits to become the bottleneck point of time to consider, only consider! open parameters, in most cases the data warehouse does not have this problem.
more extreme, you can also increase the online redo file to 1~2g or even larger, close the archive, reduce the wait for log switching, this article needs to weigh the scene, do not use in the production environment. External Table
    • Cannot DML, cannot build indexes, does not support partitioning
    • Suitable for use only once, without modification, convenient load into the data, can be parallel query, you can Nested_loop JOIN, you can Hash_join
    • Scene with merge in external table
System-level temp table (no DML lock, no redo)Transaction ClassSession Level Direct Path Insert   materialized view: Space Exchange Time   Table Space Migrationpartitions that can transmit partitioned tables, which are transfers at the physical file level, differ from the SQL level and belong to the highestperformance for scenarios where sub-libraries and sub-libraries are rolled up to the central library across geographies.    DataStage aspects of the treatment
1. Bulk Load Mode
Read end: Set enable partitioned reads, modulus mode partition read integer (Zeile)
Write End: Oracle Connect Select Bulk Load mode
drop all indexes, primary keys, and so on before the bulk load is written. After the end of the reconstruction. DataStage Host in the case of multi-CPU, it is recommended to set up multiple parallel node ETL job, easy to push the IO to the limit.  
Before SQL Statementafter SQL Statementnode's settingsthe effects of multiple node parallelism are as follows:if the bottleneck is on DataStage node, (You can test, node's file based on the Linux Tmpfs, that is,/dev/shm/tmp to speed up, to avoid in the ETL process data through the DataStage host disk, increase the IO bottleneck point, the host's memory is large enough, such as 64G, need to test! ) # mkdir/dev/shm/tmp# chmod-r 777/dev/shm/tmp# Mount--bind/dev/shm/tmp/tmplike this, you can directly use/ tmp to do node file storage.  consider limiting the amount of usage can also be used # mount Tmpfs/tmp-t tmpfs-o size=512mLimit/tmp mount TMPFS can only be used with 512m Similarly, the temp table space of Oracle under a large memory host can be placed here, provided that the use of the temp table space is stable, the DBA can estimate the range of fluctuations used, and turn off autogrow. Rigorous testing is required!

Optimization of traditional Data Warehouse projects (for oracle+datastage)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.