INSERT Process Overview
First, let's take a quick look at the processing steps when inserting a row. Each step in these steps has the potential for optimization, which will be discussed later.
1. prepare statements on the client. For dynamic SQL statements, this step is required before the statement is executed. The performance here is very important. For static SQL statements, the performance of this step is not actually relevant, the statement preparation is completed in advance.
2. on the client, the column values of the row to be inserted are assembled and sent to the DB2 server.
3. the DB2 server determines which page this row is inserted.
4. DB2 reserves a location in the buffer pool used for this page. If DB2 selects an existing page, it needs to read the disk; if a new page is used, it should be in the tablespace (if it is SMS, that is, the tablespace of the system management storage) physically allocate space for the page. Each page with a new row inserted is written from the buffer pool to the disk.
5. format the row on the target page and obtain an X (exclusive) Row lock on the row.
6. Write a record that reflects the insert into the log buffer.
7. Commit the transaction that contains the insert statement. If the records in the log buffer are not written into the log file, write these records to the log file.
In addition, many types of additional processing may occur, depending on the database configuration, such as the existence of indexes or triggers. This extra processing is also significant for performance and will be discussed later.
INSERT alternative
Before discussing in detail the optimization of insert, let's first consider the two insert alternatives: load and import. The import utility is actually a front-end of SQLINSERT, but some of its functions are also useful to you. Load also has some useful additional functions, but the main reason we use load instead of insert is to improve performance.
Load directly formats the data page, avoiding most of the overhead for processing each row due to insertion (for example, log records are actually eliminated here ). In addition, load can make better use of parallelism on multi-processor machines. There are two new features in V8load, which have a special effect on load as an alternative to insert. These two features are: loading from the cursor and loading from the call layer interface (CLI) application.
Load from cursor
This method can be used for application code (through db2LoadAPI) or for DB2 scripts. The following is an example of the latter case:
Declarestaffcursorcursorforselect * fromstaff;
Loadfromstaffcursorofcursorinsertintomyschema. new_staff;
The two rows can be replaced by the following line:
Insertintomyschema. new_staffselect * fromstaff
Compared with the equivalent INSERT... SELECT statement, the performance of loading from the cursor can be improved by almost 20%.
Load from CLI
This method is obviously limited to calling layer interface (CLI) applications, but it is very fast. This technique is very similar to array insertion. DB2 comes with such an example. The speed of using load is twice the speed of using fully optimized array insertion, it is almost 10 times faster than the unoptimized array insert.
All INSERT improvements
Let's take a look at some of the necessary insert processing steps and the skills we can use to optimize these steps.
1. prepare statements
As an SQL statement, the INSERT statement must be compiled by DB2 before execution. This step can automatically occur (for example, in CLP or in a CLISQLExecDirect call) or explicitly (for example, through a SQLPrepare, CLISQLPrepare, or JDBCprepareStatement Statement ). This compilation process involves authorization check, optimization, and other activities required to convert statements into executable formats. During statement compilation, the statement access plan is stored in the package cache.
If you execute the same INSERT statement repeatedly, the access plan of the statement (usually) enters the package cache, eliminating the compilation overhead. However, if the insert statement has different values for each row, each statement is regarded as unique and must be compiled separately. Therefore, the following statements are repeated:
Insertintomytablevalues (1, 'abc ')
Insertintomytablevalues (2, 'def ')
Replace the statement with the parameter mark, prepare it at a time, and execute it repeatedly. This is very desirable:
Insertintomytablevalues (?,?)
Using Parameter markers can speed up the operation of a series of insert statements several times. (Using host variables in static SQL programs also offers similar benefits .)
2. Send the column value to the server
There are several optimization techniques that can be classified into this type. The most important technique is to include multiple rows in each insert statement, which avoids client-server communication for each row and reduces DB2 overhead. Tips for using multiline inserts:
The VALUES clause contains multiple rows. For example, the following statement inserts three rows: INSERTINTOmytableVALUES (1, 'abc'), (2, 'def '), (3, 'ghi ')
Use array insert in CLI ). You need to prepare an INSERT statement with parameter tags to define an array for storing the values to be inserted and bind the array to the parameter tag, and execute insert once for a group of content in each array. In addition, the sample program sqllib/samples/cli/tbload. c provides the basic framework for Array insertion (but CLILOAD is executed ). Changing from an array to an array containing 100 rows can reduce the time by about 2.5 times. Therefore, try to use an array containing at least 100 rows.
Use batch processing in JDBC. This is the same as array insertion in CLI. It is based on the same concept, but the implementation details are different. After the insert statement is prepared through the prepareStatement method, the remaining steps are to call the appropriate setXXXX method (for example, setString or setInt) for each column, and then addBatch. Repeat these steps for each row to be inserted, and then call executeBatch to execute the insert. For more information, see JDBCTutorial in the references section.
Load is used to quickly load data into a staging table, and then INSERT... SELECT is used to fill the main table. (The cost saved by this method is that the load speed is very fast, and the INSERT... SELECT transfers data within DB2 (on the server), eliminating the cost of communication. In general, we will not use this method unless we need to perform additional load processing in INSERT... SELECT.
If it is not possible to pass multiple rows in an insert statement, it is best to combine multiple insert statements and pass them together from the client to the server. (However, this means that each insert contains different values and needs to be prepared. Therefore, the performance is actually worse than the performance when parameter tags are used .) You can combine multiple statements into one statement using CompoundSQL:
In SQL, composite statements are created through BEGINATOMIC or BEGINCOMPOUND statements.
In CLI, composite statements can be created through SQLExecDirect and SQLExecute calls. For DB2V8FixPak4, another method to generate a composite statement is to set the statement attribute SQL _ATTR_CHAINING_BEGIN before multiple SQLExecute calls (for a preprocessing statement), and set the statement attribute SQL _ATTR_CHAINING_END after the call.
The following are some suggestions for this topic:
If possible, allow the client to use the same code page as the database to be accessed to avoid switching costs on the server. The database code page can be determined by running "getdbcfgfor.
In some cases, the CLI will automatically perform data type conversion, but this will also cause invisible (small) performance loss. Therefore, try to make the inserted value directly in the format corresponding to the corresponding column.
Minimize the insert-related overhead in the application. For example, when array insertion is used in CLI, for the whole group of inserts, we should try to ensure that each column is executed only once SQLBindParameter, instead of once for each group of array content. For individuals, these calls are not costly, but these costs are cumulative.