Massive Data processing is a common requirement for many system developers, sometimes even O & M personnel. Interfaces for loading and storing massive data files, batch data updating, and staged data archiving and deletion are common application requirements. For different actual situations, including hardware and software, O & M environment, and SLA window requirements, we need to adopt different policies and methods to solve the problem.
In my previous article "how to update a large table record in Oracle" (), I introduced some methods and considerations for large table data processing with the Oracle database as the center. Simply put, the difficulties of massive data processing are not at the statement level, but how to balance various requirements. Common factors include:
Ü normal production impact of business systems. The vast majority of big data operations are in the production environment. With the availability demand increasingly increasing, after an SQL statement is run in the business system, the speed of core operations is slowed down, and even the system crashes. This is definitely not what our O & M personnel want to see;
Ü operation window duration. In the case of the same business operation volume, the smooth operation load must be on the premise of increasing the operation time. Whether the extended operation time can be completed in the maintenance window is also a matter of consideration;
Ü impact on data consistency. Although some "rumor" methods (such as nologging) can reduce the operation load, they may bring disaster impact to the continuity of system backup;
In addition, the SQL statement itself is optimized and the operation policy can be improved. However, some problems still require a lot of data processing. When other conventional means are exhausted, parallel and concurrent operations are often a good option with hardware conditions.
In 11gR2, Oracle provides a lot of convenient support for massive data processing. The toolkit dbms_parallel_execute can split massive data into independent chunk tasks and execute jobs in parallel. This article describes how to use this new feature.
1. Prepare the environment
The lab environment is 11.2.0.3.
SQL> select * from v $ version;
BANNER
------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0-Production
PL/SQL Release 11.2.0.3.0-Production
CORE 11.2.0.3.0 Production
TNS for Linux: Version 11.2.0.3.0-Production
NLSRTL Version 11.2.0.3.0-Production
Construct a large table. Note: Due to the limited conditions, the author's environment is relatively simple, and some performance advantages are hard to reflect. Create a single tablespace first.
SQL> create tablespace test datafile size 2G autoextend on
2 extent management local uniform size 1 m
3 segment space management auto;
Tablespace created
SQL> create table t as select * from dba_objects;
Table created
SQL> insert into t select * from t;
75586 rows inserted
(A series of insert operations ......)
SQL> commit;
Commit complete
The data table T contains about 2 million records and occupies about 2 GB of space.
SQL> select count (*) from t;
COUNT (*)
----------
19350016
SQL> select bytes/1024/1024/1024, tablespace_name from dba_segments where owner = 'sys 'and segment_name = 'T ';
BYTES/1024/1024/1024 TABLESPACE_NAME
--------------------------------------------------
2.0986328125 TEST
Dbms_parallel_execute is not a traditional multi-process parallel operation. In essence, it completes a series of jobs through the job manager Schedule (this will be proved in detail later ). Therefore, the job_queue_processes parameter must not be set to 0.
SQL> show parameter job
NAME TYPE VALUE
--------------------------------------------------
Job_queue_processes integer 1000
2. Introduction to dbms_parallel_execute package execution
Dbms_parallel_execute is a new parallel operation interface launched by Oracle 11g. The principle is: When Oracle needs to process a large amount of data processing, especially the update operation, it can be split into several chunks, with a multi-process Job (Schedule Job) perform operations in multiple parts. This reduces the usage of one-time undo and further facilitates resumable upload.
The Dbms_parallel_execute package must meet two conditions:
Ü the execution program user must have the create job system permission;
Ü Dbms_parallel_execute some methods of the dbms_ SQL package need to be called during the execution of the package, so the package execution permission is also required;
The number of concurrent job processes and the chunk splitting method must be determined by the caller.
In the traditional single-thread execution policy, no matter how large the task is, it corresponds to a Server Process for processing. If parallel execution is called, the corresponding coordination process and working process exist (v $ px_process ).
If parallel execution is enabled, a key issue is how to divide tasks into multiple small data sets for updating a data table. The Dbms_parallel_execute package supports three job division methods.
Ü By_rowid: divides operation data based on rowid;
Ü By_number_col: enter a numeric column name and divide it based on the value of the column;
Ü By_ SQL statement method: provides an SQL statement to help you define the start and end IDs of each chunk;
Among the three methods, the author recommends the rowid method on the grounds that the condition requirements are low and the operation speed is fast. This policy is preferred if data table jobs are not explicitly configured during the operation. The specific comparison can be seen from the experiment below.
The partitioning method is determined, and the size of each chunk is also determined. Note: The chunk size is not necessarily the number of data rows operated by each chunk. There are different policies for different partition types. In the following experiment, I will give a clear explanation.
The number of parallel processes indicates how many workers are prepared after a "one piece" task is divided into "one pile" of independent task sets. This is the key to the use of parallel packages. Similar to the degree of parallelism, it must be comprehensively considered based on the actual hardware and software Resource load.
There is a problem with a long job, that is, the caller wants to know the execution status at any time. Oracle provides two data views: user_parallel_execute_tasks and user_parallel_execute_chunks to view the execution status of the Task and the execution completion status of each chunk.
In the official Oracle documents, the method flow for calling the dbms_parallel_execute package is provided. This article uses the variant of this script, which is explained here. Next, let's first look at the first by rowid method.
3. Partitioning chunk By Rowid
Rowid in Oracle represents the actual physical location of data. Using rowid to directly locate data is currently the fastest way to obtain data from Oracle. Therefore, in RBO, the first execution plan is determined as the rowid access method.
Based on the PL/SQL anonymous block provided in the Oracle document, we will query the first rowid range for the modification.
Declare
Vc_task varchar2 (100 );
Vc_ SQL varchar2 (1000 );
N_try number;
N_status number;
Begin
-- Define the Task
Vc_task: = 'Task 1: By rowid'; -- Task Name
Dbms_parallel_execute.create_task (task_name => vc_task); -- Define a Task manually;
-- Define the Spilt
Dbms_parallel_execute.create_chunks_by_rowid (task_name => vc_task,
Table_owner => 'sys ',
Table_name => 'T ',
By_row => true,
Chunk_size => 1000); -- Define Chunk
Vc_ SQL: = 'Update/* + ROWID (dda) */t set DATA_OBJECT_ID = object_id + 1 where rowid between: start_id and: end_id ';
-- Run the task
Dbms_parallel_execute.run_task (task_name => vc_task,
SQL _stmt => vc_ SQL,
Export age_flag => dbms_ SQL .native,
Parallel_level => 2); -- executes the task and determines the degree of parallelism.
-- Controller
N_try: = 0;
N_status: = dbms_parallel_execute.task_status (task_name => vc_task );
While (n_try <2 and n_status! = Dbms_parallel_execute.FINISHED) loop
Dbms_parallel_execute.resume_task (task_name => vc_task );
N_status: = dbms_parallel_execute.task_status (task_name => vc_task );
End loop;
-- Deal with Result
Dbms_parallel_execute.drop_task (task_name => vc_task );
End;
/
From the call process, this parallel operation includes the following steps:
Ü define a Task;
Ü determine the chunk division method and define the range information of each chunk;
Ü execute jobs and determine the number of parallel job processes;
This call process is very different from our common parallel methods, similar to Oracle's Job Schedule mechanism. Because the execution process is long, we can easily view the parallel execution package.
The key information of the current job is displayed in user_parallel_execute_tasks. Note: chunk_type indicates the partitioning method. JOB_PREFIX corresponds to content in Schedule.
SQL> select task_name, chunk_type, JOB_PREFIX from user_parallel_execute_tasks;
TASK_NAME CHUNK_TYPE JOB_PREFIX
--------------------------------------------------------------
Task 1: By Rowid ROWID_RANGE TASK $ _ 4
In user_parallel_execute_chunks, all chunks of the job are divided and each chunk corresponds to a row of data. It includes the start and end rowid of the chunk. The corresponding chunk value corresponds to the number of data rows of each chunk.
SQL> select chunk_id, task_name, status, start_rowid, end_rowid from user_parallel_execute_chunks where rownum <10;
CHUNK_ID TASK_NAME STATUS START_ROWID END_ROWID
--------------------------------------------------------------------------------------
1 Task 1: By Rowid processed aaatlkaahaaaacaaaa AAATLKAAHAAAACxCcP
2 Task 1: By Rowid PROCESSED AAATLKAAHAAAACyAAA AAATLKAAHAAAADjCcP
3 Task 1: By Rowid PROCESSED AAATLKAAHAAAADkAAA AAATLKAAHAAAAD/CcP
4 Task 1: By Rowid processed aaatlkaahaaaaeaaaa AAATLKAAHAAAAExCcP
5 Task 1: By Rowid PROCESSED AAATLKAAHAAAAEyAAA AAATLKAAHAAAAFjCcP
6 Task 1: By Rowid PROCESSED AAATLKAAHAAAAFkAAA AAATLKAAHAAAAF/CcP
7 Task 1: By Rowid processed aaatlkaahaaaagaaaa AAATLKAAHAAAAGxCcP
8 Task 1: By Rowid PROCESSED AAATLKAAHAAAAGyAAA AAATLKAAHAAAAHjCcP
9 Task 1: By Rowid PROCESSED AAATLKAAHAAAAHkAAA AAATLKAAHAAAAH/CcP
9 rows selected
As user_parallel_execute_chunks, a very important field is the status column, which is used to mark the processing status of each chunk. We can determine the task completion based on this field.
SQL> select status, count (*) from user_parallel_execute_chunks group by status;
Status count (*)
------------------------------
ASSIGNED 2
UNASSIGNED 5507
PROCESSED 938
(After a while .......)
SQL> select status, count (*) from user_parallel_execute_chunks group by status;
Status count (*)
------------------------------
ASSIGNED 2
UNASSIGNED 5441
PROCESSED 1004
From the status field, we can analyze the working principle of parallel jobs. After dividing a chunk record, it is set to the unassiged state, including the start and end id information (rowid or column_range ). The chunk processed each time is in the assigned State. In the experiment program, we set parallel_level to 2, so every time two chunks are in the assigned state. After the processing is complete, set the status to processed.
The biggest problem with massive data updates is the undo expansion volume. Let's check the undo size during execution.
SQL> select sum (bytes)/1024/1024 from dba_undo_extents where status = 'active ';
SUM (BYTES)/1024/1024
--------------------
SQL> select sum (bytes)/1024/1024 from dba_undo_extents where status = 'active ';
SUM (BYTES)/1024/1024
--------------------
16
SQL> select sum (bytes)/1024/1024 from dba_undo_extents where status = 'active ';
SUM (BYTES)/1024/1024
--------------------
10
Each time the data volume is small, it means that each operation is a small chunk. It is also determined that the execution process using parallel is a small commit process step by step. In the job view, we can also clearly see the job information.
SQL> select owner, job_name, JOB_ACTION, SCHEDULE_TYPE, state, last_start_date from dba_scheduler_jobs where job_name like 'Task $ _ 4% ';
OWNER JOB_NAME JOB_ACTION SCHEDULE_TYPE STATE LAST_START_DATE
---------------------------------------------------------------------------------------------------------------------------
Sys task $ _ 4_2 DBMS_PARALLEL_EXECUTE.RUN_INTERNAL_WORKER immediate running 10-2-14 01.48.34.947417 PM PRC
Sys task $ _ 4_1 DBMS_PARALLEL_EXECUTE.RUN_INTERNAL_WORKER immediate running 10-2-14 01.48.34.730487 PM PRC
Note: The data information is not displayed in the traditional parallel process v $ px_process, indicating that the parallel package is not a traditional Oracle Database Parallel solution.
SQL> select * from v $ px_process;
SERVER_NAME status pid spid sid serial #
--------------------------------------------------------------------------
Execution end information:
25 -- Controller
26 n_try: = 0;
27 n_status: = dbms_parallel_execute.task_status (task_name => vc_task );
28 while (n_try <2 and n_status! = Dbms_parallel_execute.FINISHED) loop
29 dbms_parallel_execute.resume_task (task_name => vc_task );
30 n_status: = dbms_parallel_execute.task_status (task_name => vc_task );
31 end loop;
32
33 -- Deal with Result
34 dbms_parallel_execute.drop_task (task_name => vc_task );
35 end;
36/
PL/SQL procedure successfully completed
Executed in 944.453 seconds
2G data is updated for a total of 945 s, totaling about 16 minutes.
From the above data view and calling process, we can draw the following conclusions:
For the dbms_parallel_execute execution package, by determining the chunk method and chunk size, you can divide a large data set into several small chunk sets and perform operations step by step. The parallel_level set in the Code is reflected in setting the number of jobs. After a Job is started, Oracle does not start the traditional parallel mechanism, but creates a parallel_level number Job based on the Job Schedule. The Job type is immediate execution. Multiple jobs perform small chunk jobs respectively. One advantage of using Job Schedule is that it can easily perform the resume and start processes of jobs.
Next we will discuss the two ways to execute by number col and by SQL.
For more details, please continue to read the highlights on the next page: