一張6億條資料表引發的事故,一張6億引發事故

來源:互聯網
上載者:User

一張6億條資料表引發的事故,一張6億引發事故
業務人員告訴我某系統磁碟IO持續高達300MB/s,系統平台為AIX,遂 topas 查看果然如此。

用下面指令碼到Oracle資料庫中看了一下:

SELECT Disk_Reads DiskReads, Executions, SQL_ID, SQL_Text SQLText,    SQL_FullText SQLFullText FROM(   SELECT Disk_Reads, Executions, SQL_ID, LTRIM(SQL_Text) SQL_Text,       SQL_FullText, Operation, Options,       Row_Number() OVER          (Partition By sql_text ORDER BY Disk_Reads * Executions DESC)          KeepHighSQL   FROM   (       SELECT Avg(Disk_Reads) OVER (Partition By sql_text) Disk_Reads,           Max(Executions) OVER (Partition By sql_text) Executions,           t.SQL_ID, sql_text, sql_fulltext, p.operation,p.options       FROM v$sql t, v$sql_plan p       WHERE t.hash_value=p.hash_value AND p.operation='TABLE ACCESS'        AND p.options='FULL' AND p.object_owner NOT IN ('SYS','SYSTEM')       AND t.Executions > 1   )    ORDER BY DISK_READS * EXECUTIONS DESC)WHERE KeepHighSQL = 1AND rownum <=5;

這裡當時沒有儲存記錄,總之跟後來AWR收集的是一樣的sql,如下所示。


可以看到第一個sql的物理讀非常高。遂看了一下執行計畫

SYS@zbdba>select * from table(dbms_xplan.display_cursor('54043712',null,'advanced'));PLAN_TABLE_OUTPUT--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------HASH_VALUE  54043712, child number 0------------------------------------SELECT D.MODEL_ID, D.OBJ_ID, D.OBJ_TYPE, D.DATA_TYPE, D.DATA_DATE, D.DATA_FROM_DATE, D.DATA_TO_DATE,D.DATA_FLAG  FROM zbdba1 D, zbdba2 C  WHERE D.MODEL_ID = C.MODEL_ID AND C.COLLECT_ID =:COLLECT_ID        AND D.DATA_DATE = :DATA_DATE AND D.DATA_TYPE = :DATA_TYPE AND D.VALUE_FLAG = 0Plan hash value: 1780662521-------------------------------------------------------------------------------------------------------------------------| Id  | Operation                    | Name                     | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |-------------------------------------------------------------------------------------------------------------------------|   0 | SELECT STATEMENT             |                          |       |       |     9 (100)|          |       |       ||   1 |  NESTED LOOPS                |                          |     1 |    61 |     9   (0)| 00:00:01 |       |       ||   2 |   PARTITION RANGE SINGLE     |                          |     1 |    51 |     8   (0)| 00:00:01 |   KEY |   KEY ||*  3 |    TABLE ACCESS FULL         | zbdba1 |     1 |    51 |     8   (0)| 00:00:01 |   KEY |   KEY ||*  4 |   TABLE ACCESS BY INDEX ROWID| zbdba2             |     1 |    10 |     1   (0)| 00:00:01 |       |       ||*  5 |    INDEX UNIQUE SCAN         | PK_zbdba2          |     1 |       |     0   (0)|          |       |       |-------------------------------------------------------------------------------------------------------------------------Query Block Name / Object Alias (identified by operation id):-------------------------------------------------------------   1 - SEL$1   3 - SEL$1 / D@SEL$1   4 - SEL$1 / C@SEL$1   5 - SEL$1 / C@SEL$1Outline Data-------------  /*+      BEGIN_OUTLINE_DATA      IGNORE_OPTIM_EMBEDDED_HINTS      OPTIMIZER_FEATURES_ENABLE('10.2.0.5')      ALL_ROWS      OUTLINE_LEAF(@"SEL$1")      FULL(@"SEL$1" "D"@"SEL$1")      INDEX_RS_ASC(@"SEL$1" "C"@"SEL$1" ("zbdba2"."MODEL_ID"))      LEADING(@"SEL$1" "D"@"SEL$1" "C"@"SEL$1")      USE_NL(@"SEL$1" "C"@"SEL$1")      END_OUTLINE_DATA  */Peeked Binds (identified by position):--------------------------------------   1 - :COLLECT_ID (VARCHAR2(30), CSID=852): '70350'   2 - :DATA_DATE (DATE): 06/01/15 00:00:00   3 - :DATA_TYPE (VARCHAR2(30), CSID=852): '02'Predicate Information (identified by operation id):---------------------------------------------------   3 - filter(("D"."DATA_TYPE"=:DATA_TYPE AND "D"."DATA_DATE"=:DATA_DATE AND "D"."VALUE_FLAG"=0))   4 - filter("C"."COLLECT_ID"=TO_NUMBER(:COLLECT_ID))   5 - access("D"."MODEL_ID"="C"."MODEL_ID")Column Projection Information (identified by operation id):-----------------------------------------------------------   1 - "D"."MODEL_ID"[NUMBER,22], "D"."OBJ_ID"[NUMBER,22], "D"."OBJ_TYPE"[VARCHAR2,2],       "D"."DATA_TYPE"[VARCHAR2,2], "D"."DATA_DATE"[DATE,7], "D"."DATA_FROM_DATE"[DATE,7], "D"."DATA_TO_DATE"[DATE,7],       "D"."DATA_FLAG"[NUMBER,22]   2 - "D"."MODEL_ID"[NUMBER,22], "D"."OBJ_ID"[NUMBER,22], "D"."OBJ_TYPE"[VARCHAR2,2],       "D"."DATA_TYPE"[VARCHAR2,2], "D"."DATA_DATE"[DATE,7], "D"."DATA_FROM_DATE"[DATE,7], "D"."DATA_TO_DATE"[DATE,7],       "D"."DATA_FLAG"[NUMBER,22]   3 - "D"."MODEL_ID"[NUMBER,22], "D"."OBJ_ID"[NUMBER,22], "D"."OBJ_TYPE"[VARCHAR2,2],       "D"."DATA_TYPE"[VARCHAR2,2], "D"."DATA_DATE"[DATE,7], "D"."DATA_FROM_DATE"[DATE,7], "D"."DATA_TO_DATE"[DATE,7],       "D"."DATA_FLAG"[NUMBER,22]   5 - "C".ROWID[ROWID,10]HASH_VALUE  54043712, child number 1SELECT D.MODEL_ID, D.OBJ_ID, D.OBJ_TYPE, D.DATA_TYPE, D.DATA_DATE, D.DATA_FROM_DATE, D.DATA_TO_DATE,D.DATA_FLAG  FROM zbdba1 D, zbdba2 C  WHERE D.MODEL_ID = C.MODEL_ID AND C.COLLECT_ID =:COLLECT_ID        AND D.DATA_DATE = :DATA_DATE AND D.DATA_TYPE = :DATA_TYPE AND D.VALUE_FLAG = 0NOTE: cannot fetch plan for HASH_VALUE: 54043712, CHILD_NUMBER: 1      Please verify value of HASH_VALUE and CHILD_NUMBER;      It could also be that the plan is no longer in cursor cache (check v$sql_plan)81 rows selected.
查看錶行數:
SYS@zbdba>select num_rows,last_analyzed from dba_tables where table_name='zbdba1';  NUM_ROWS LAST_ANALYZE---------- ------------659764063 31-JUL-15SYS@zbdba>select num_rows,last_analyzed from dba_tables where table_name='zbdba2';  NUM_ROWS LAST_ANALYZE---------- ------------     76513 14-JUL-15
explan plan for 該sql查看執行計畫:SYS@zbdba>select * from table(dbms_xplan.display());PLAN_TABLE_OUTPUT--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Plan hash value: 2057366878----------------------------------------------------------------------------------------------------------------------------------| Id  | Operation                          | Name                        | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |----------------------------------------------------------------------------------------------------------------------------------|   0 | SELECT STATEMENT                   |                             |     3 |   183 |    11   (0)| 00:00:01 |       |       ||   1 |  TABLE ACCESS BY GLOBAL INDEX ROWID| zbdba1    |     1 |    51 |     4   (0)| 00:00:01 | ROWID | ROWID ||   2 |   NESTED LOOPS                     |                             |     3 |   183 |    11   (0)| 00:00:01 |       |       ||   3 |    TABLE ACCESS BY INDEX ROWID     | zbdba2                |     2 |    20 |     4   (0)| 00:00:01 |       |       ||*  4 |     INDEX RANGE SCAN               | IDX_zbdba2COLLECTFLAG |     2 |       |     2   (0)| 00:00:01 |       |       ||*  5 |    INDEX RANGE SCAN                | IDX_C_COLLECT_MD_001        |     1 |       |     3   (0)| 00:00:01 |       |       |----------------------------------------------------------------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   4 - access("C"."COLLECT_ID"=TO_NUMBER(:COLLECT_ID))   5 - access("D"."MODEL_ID"="C"."MODEL_ID" AND "D"."DATA_DATE"=:DATA_DATE AND "D"."DATA_TYPE"=:DATA_TYPE AND              "D"."VALUE_FLAG"=0)19 rows selected.

發現走了索引,但是當時為什麼沒有走索引呢。並且在二節點走了全表掃描,一節點走的是索引範圍掃描。


該oracle資料庫的版本為10.2.0.5,在11g之前沒有引入ACS(Adaptive Cursor Sharing),所以這裡CBO在第一次進行硬解析的時候才會窺視變數的值,並且產生執行計畫,之後一直使用相同的執行計畫。
這裡我猜想,在2節點。第一次使用綁定變數的時候,CBO認為應該使用全表掃描效率更高,所以在以後一直使用該執行計畫。然而在一節點,第一次使用綁定變數的時候,CBO認為走範圍掃描效率更高。所以這裡導致1節點和2節點的執行計畫不一樣。




找到原因了,就好辦了。業務人員怕資料庫負載過大導致宕機,遂叫我把該sql的相關進程(發現30個進程)全部kill掉。kill完進程後磁碟IO瞬間降到50MB。


還沒有完,改sql以後還會有這種選擇。我們怎麼去避免?
既然瞭解了CBO的做法,那就觸發它再一次去執行一次硬解析獲得正確的執行計畫。有如下4種方法:
1、alter system flush shared_pool(想跪就跑這個)
2、對相關表做DDL操作
3、重新收集統計資訊
4、dbms_shared_pool.purge


前面三種方案對生產系統都影響比較大,所以利用第四種方法。

SQL> select address,hash_value,executionsfrom v$sql where hash_value=54043712 ADDRESS          HASH_VALUE EXECUTIONS ---------------- ---------- ---------- -----------0000040229F039E0 54043712          1           SQL> alter session set events '5614566 trace name context forever';SQL> exec dbms_shared_pool.purge('0000040229F039E0,54043712','C');
重新利用合適的綁定變數跑出正確的執行計畫即可。


如何永久保持不變呢?
加hint,強制走索引。

explain plan for SELECT /*+ index(zbdba1,IDX_C_COLLECT_MD_001)* / D.MODEL_ID,       D.OBJ_ID,       D.OBJ_TYPE,       D.DATA_TYPE,       D.DATA_DATE,       D.DATA_FROM_DATE,       D.DATA_TO_DATE,       D.DATA_FLAG  FROM EIC2.zbdba1 D, EIC2.zbdba2 CWHERE D.MODEL_ID = C.MODEL_ID   AND C.COLLECT_ID = '70350'   AND D.DATA_DATE = to_date('06/01/15','MM/DD/YY')   AND D.DATA_TYPE = '02'   AND D.VALUE_FLAG = 0;

如果以後該索引有變化,也將會失效。


當然最佳化沒有一勞永逸的事情,針對大小超過50GB,資料量高達6億條的表還是要定期檢查它相關sql的執行計畫。




著作權聲明:本文為博主原創文章,未經博主允許不得轉載。

相關文章

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.