全文檢索索引 (不包含、不等於) 索引優化 - 阿裡雲RDS PostgreSQL最佳實踐

來源:互聯網
上載者:User

摘要: 幕後PostgreSQL內建了GIN索引,支援全文檢索索引,支援陣列取出等多值資料類型的取出。在全文檢索索引中,不包含某個關鍵字能用到索引嗎?實際上GIN是倒排索引,不包含某個關鍵字的查詢,實際上是跳過主tree上面的TOKEN的掃描。

幕後

PostgreSQL內建了GIN索引,支援全文檢索索引,支援陣列取出等多值資料類型的取出。

在全文檢索索引中,不包含某個關鍵字能用到索引嗎?

實際上GIN是倒排索引,不包含某個關鍵字的查詢,實際上是跳過主tree上面的TOKEN的掃描。

只要被跳過的TOKEN包含了大量資料,那麼就是划算的。PostgreSQL是基於CBO的執行計畫優化器,所以會自動選擇最優的索引。

例子1,全文檢索索引不包含查詢

1、建立測試表

postgres=# create table notcontain (id int, infotsvector);

CREATE TABLE

2、建立生成推測字元字串的函數

CREATE OR REPLACE FUNCTION

gen_rand_str(integer)

RETURNS text

LANGUAGE sql

STRICT

AS $function$

selectstring_agg(a[(random()*6)::int+1],'') from generate_series(1,$1), (selectarray['a','b','c','d','e','f',' ']) t(a);

$function$;

3、插入100萬測試資料

postgres=# insert into notcontain selectgenerate_series(1,1000000), to_tsvector(gen_rand_str(256));

4、建立全文索引(GIN索引)

create index idx_notcontain_info on notcontainusing gin (info);

5、查詢某一條記錄

postgres=# select * from notcontain limit 1;

-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

id | 1

info | 'afbbeeccbf':3 'b':16 'bdcdfd':2'bdcfbcecdeeaed':8 'bfedfecbfab':7 'cd':9'cdcaefaccdccadeafadededddcbdecdaefbcfbdaefcec':14 'ceafecff':6 'd':17,18'dbc':12 'dceabcdcbdca':10 'dddfdbffffeaca':13 'deafcccfbcdebdaecda':11'dfbadcdebdedbfa':19 'eb':15 'ebe':1'febdcbdaeaeabbdadacabdbbedfafcaeabbdcedaeca':5 'fedeecbcdfcdceabbabbfcdd':4

6、測試不包含某個關鍵字

資料庫自動選擇了全資料表掃描,沒有使用GIN索引。

為什麼沒有使用索引呢,我前面解釋了,因為這個關鍵字的資料記錄太少了,不包含它時使用索引遮罩不划算。

(當包含它時,使用GIN索引就非常划算。包含和不包含是相反的程序,成本也是反的)

select * from notcontain t1 where info @@to_tsquery ('!eb');

postgres=# explain(analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info@@ to_tsquery ('!eb');

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

Seq Scan onpostgres.notcontain t1(cost=0.00..318054.51 rows=950820 width=412) (actualtime=0.016..1087.463 rows=947911 loops=1)

Output:id, info

Filter:(t1.info @@ to_tsquery('!eb'::text))

RowsRemoved by Filter: 52089

Buffers: shared hit=55549

Planningtime: 0.131 ms

Executiontime: 1134.571 ms

(7 rows)

7、強制關閉全資料表掃描,讓資料庫選擇索引。

可以看到,使用索引確實是慢的,我們大多數時候應該相信資料庫的成本規劃是準確的。(只要成本因數和環境效能配對足夠的准,這些都是可以校準的,有興趣的同學可以參考我寫的因數校準方法。)

postgres=# set enable_seqscan=off;

SET

postgres=# explain(analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info@@ to_tsquery ('!eb');

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------

Bitmap HeapScan on postgres.notcontain t1(cost=82294981.00..82600120.25 rows=950820 width=412) (actualtime=1325.587..1540.145 rows=947911 loops=1)

Output:id, info

RecheckCond: (t1.info @@ to_tsquery('!eb'::text))

HeapBlocks: exact=55549

Buffers:shared hit=171948

-> Bitmap Index Scan on idx_notcontain_info (cost=0.00..82294743.30 rows=950820 width=0)(actual time=1315.663..1315.663 rows=947911 loops=1)

Index Cond: (t1.info @@ to_tsquery('!eb'::text))

Buffers: shared hit=116399

Planningtime: 0.135 ms

Executiontime: 1584.670 ms

(10 rows)

例子2,全文檢索索引不包含查詢

這個例子造一份傾斜資料,這個TOKEN包含了大量的重複記錄,通過不包含遮罩它。看看能否使用索引。

1、生成測試資料

postgres=# truncate notcontain ;

TRUNCATE TABLE

postgres=# insert into notcontain selectgenerate_series(1,1000000), to_tsvector('abc');

INSERT 0 1000000

2、測試不包含ABC的取出

資料庫自動選擇了索引掃描,跳過了不需要取出的資料區塊。

postgres=# explain(analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info@@ to_tsquery ('!abc');

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------

Bitmap HeapScan on postgres.notcontain t1(cost=220432.15..220433.71 rows=1 width=21) (actualtime=107.936..107.936 rows=0 loops=1)

Output:id, info

RecheckCond: (t1.info @@ to_tsquery('!abc'::text))

Buffers:shared hit=268

-> Bitmap Index Scan on idx_notcontain_info (cost=0.00..220432.15 rows=1 width=0) (actualtime=107.933..107.933 rows=0 loops=1)

Index Cond: (t1.info @@ to_tsquery('!abc'::text))

Buffers: shared hit=268

Planningtime: 0.183 ms

Executiontime: 107.962 ms

(9 rows)

3、強制使用全資料表掃描,發現效能確實不如索引掃描,也驗證了我們說的PostgreSQL是基於成本的優化器,自動選擇最優的執行計畫。

postgres=# set enable_bitmapscan =off;

SET

postgres=# explain(analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info@@ to_tsquery ('!abc');

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------

Seq Scan onpostgres.notcontain t1(cost=0.00..268870.00 rows=1 width=21) (actual time=1065.436..1065.436rows=0 loops=1)

Output:id, info

Filter:(t1.info @@ to_tsquery('!abc'::text))

RowsRemoved by Filter: 1000000

Buffers:shared hit=6370

Planningtime: 0.059 ms

Executiontime: 1065.449 ms

(7 rows)

例子3,普通類型BTREE索引,不等於取出

這個例子是普通類型,使用BTREE索引,看看是否支援不等於的索引取出。

測試方法與GIN測試類別似,使用傾斜和非傾斜兩種測試資料。

1、非傾斜資料的不包含查詢,使用索引遮罩的記錄非常少。

目前核心層面沒有實現BTREE索引的不包含取出。(雖然技術上是可以通過INDEX SKIP SCAN來實現的,跳過不需要掃描的BRANCH節點)

postgres=# truncate notcontain ;

TRUNCATE TABLE

postgres=# insert into notcontain selectgenerate_series(1,1000000);

INSERT 0 1000000

postgres=# create index idx1 on notcontain (id);

CREATE INDEX

postgres=# set enable_bitmapscan =on;

SET

postgres=# explain(analyze,verbose,timing,costs,buffers) select * from notcontain t1 whereid<>1;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------

Seq Scan onpostgres.notcontain t1(cost=0.00..16925.00 rows=999999 width=36) (actual time=0.011..110.592rows=999999 loops=1)

Output:id, info

Filter:(t1.id <> 1)

RowsRemoved by Filter: 1

Buffers:shared hit=4425

Planningtime: 0.195 ms

Executiontime: 156.013 ms

(7 rows)

postgres=# set enable_seqscan=off;

SET

postgres=# explain(analyze,verbose,timing,costs,buffers) select * from notcontain t1 whereid<>1;

QUERYPLAN

-------------------------------------------------------------------------------------------------------------------------------------------------

Seq Scan onpostgres.notcontain t1(cost=10000000000.00..10000016925.00 rows=999999 width=36) (actualtime=0.011..110.964 rows=999999 loops=1)

Output:id, info

Filter:(t1.id <> 1)

RowsRemoved by Filter: 1

Buffers:shared hit=4425

Planningtime: 0.062 ms

Executiontime: 156.461 ms

(7 rows)

2、更換SQL寫法,可以實現索引取出。但實際上由於不是使用的INDEX SKIP SCAN,所以需要一個JOIN程序,實際上效果並不佳。

postgres=# explain(analyze,verbose,timing,costs,buffers) select * from notcontain t1 where notexists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1);

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------

Merge AntiJoin (cost=0.85..25497.28 rows=999999width=36) (actual time=0.023..277.639 rows=999999 loops=1)

Output: t1.id,t1.info

Merge Cond: (t1.id = t2.id)

Buffers: shared hit=7164

-> Index Scan using idx1 on postgres.notcontaint1 (cost=0.42..22994.22 rows=1000000width=36) (actual time=0.009..148.520 rows=1000000 loops=1)

Output: t1.id, t1.info

Buffers: shared hit=7160

-> Index Only Scan using idx1 onpostgres.notcontain t2 (cost=0.42..3.04rows=1 width=4) (actual time=0.007..0.008 rows=1 loops=1)

Output: t2.id

Index Cond: (t2.id = 1)

Heap Fetches: 1

Buffers: shared hit=4

Planningtime: 0.223 ms

Executiontime: 322.798 ms

(14 rows)

postgres=# set enable_mergejoin=off;

SET

postgres=# explain(analyze,verbose,timing,costs,buffers) select * from notcontain t1 where notexists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1);

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------

Hash AntiJoin (cost=3.05..27053.05 rows=999999width=36) (actual time=0.060..251.232 rows=999999 loops=1)

Output: t1.id,t1.info

Hash Cond: (t1.id = t2.id)

Buffers: shared hit=4432

-> Seq Scan on postgres.notcontain t1 (cost=0.00..14425.00 rows=1000000 width=36)(actual time=0.011..84.659 rows=1000000 loops=1)

Output: t1.id, t1.info

Buffers: shared hit=4425

-> Hash(cost=3.04..3.04 rows=1 width=4) (actual time=0.014..0.014 rows=1loops=1)

Output: t2.id

Buckets: 1024 Batches: 1 Memory Usage: 9kB

Buffers: shared hit=4

-> Index Only Scan using idx1on postgres.notcontain t2(cost=0.42..3.04 rows=1 width=4) (actual time=0.010..0.011 rows=1loops=1)

Output: t2.id

Index Cond: (t2.id = 1)

Heap Fetches: 1

Buffers: shared hit=4

Planningtime: 0.147 ms

Executiontime: 297.127 ms

(18 rows)

postgres=# set enable_seqscan=off;

SET

postgres=# explain(analyze,verbose,timing,costs,buffers) select * from notcontain t1 where notexists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1);

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------

Hash AntiJoin (cost=3.48..35622.27 rows=999999width=36) (actual time=0.036..324.401 rows=999999 loops=1)

Output:t1.id, t1.info

Hash Cond:(t1.id = t2.id)

Buffers:shared hit=7164

-> Index Scan using idx1 on postgres.notcontaint1 (cost=0.42..22994.22 rows=1000000width=36) (actual time=0.017..149.383 rows=1000000 loops=1)

Output: t1.id, t1.info

Buffers: shared hit=7160

-> Hash(cost=3.04..3.04 rows=1 width=4) (actual time=0.011..0.011 rows=1loops=1)

Output: t2.id

Buckets: 1024 Batches: 1 Memory Usage: 9kB

Buffers: shared hit=4

-> Index Only Scan using idx1on postgres.notcontain t2(cost=0.42..3.04 rows=1 width=4) (actual time=0.008..0.009 rows=1loops=1)

Output: t2.id

Index Cond: (t2.id = 1)

Heap Fetches: 1

Buffers: shared hit=4

Planningtime: 0.141 ms

Executiontime: 369.749 ms

(18 rows)

3、PostgreSQL還支援多核並行,所以全資料表掃描還可以暴力升階效能。

如果記錄數非常多,使用並行掃描,效能升階非常明顯。

postgres=# createunlogged table ptbl(id int);

CREATE TABLE

postgres=# insert into ptbl select generate_series(1,100000000);

postgres=# alter table ptbl set (parallel_workers=32);

iming

非並行查詢:

postgres=# set max_parallel_workers_per_gather =0;

postgres=# select count(*) from ptbl whereid<>1;

count

----------

99999999

(1 row)

Time: 11863.151 ms (00:11.863)

並行查詢:

postgres=# set max_parallel_workers_per_gather =32;

postgres=# select count(*) from ptbl whereid<>1;

count

----------

99999999

(1 row)

Time: 610.017 ms

使用並行查詢後,效能升階非常明顯。

例子4,普通類型partial BTREE索引,不等於取出

對於固定的不等於查詢,我們可以使用PostgreSQL的partialindex功能。

create table tbl (id int, info text, crt_timetimestamp, c1 int);


select * from tbl where c1<>1;


insert into tbl select generate_series(1,10000000),'test', now(), 1;

insert into tbl values (1,'abc',now(),2);


create index idx_tbl_1 on tbl(id) wherec1<>1;

cool,使用PARTIALINDEX,0.03毫秒,在1000萬資料中進行不等於取出。

postgres=# explain(analyze,verbose,timing,costs,buffers) select * from tbl where c1<>1;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------

Index Scan usingidx_tbl_1 on postgres.tbl(cost=0.12..1.44 rows=1 width=21) (actual time=0.015..0.015 rows=1loops=1)

Output:id, info, crt_time, c1

Buffers:shared hit=1 read=1

Planningtime: 0.194 ms

Executiontime: 0.030 ms

(5 rows)

小結

1、PostgreSQL內建了GIN索引,支援全文檢索索引、支援陣列等多實值型別的搜尋。

2、PostgreSQL使用基於成本的執行計畫優化器,會自動選擇最優的執行計畫,在進行不包含取出時,PostgreSQL會自動選擇是否使用索引掃描。

3、對於BTREE索引,理論上也能實現不等於的搜尋(INDEX SKIP SCAN),目前核心層面還沒有實現它,目前可以通過調整SQL的寫法來使用索引掃描。

4、PostgreSQL還支援多核並行,所以全資料表掃描還可以暴力升階效能。 如果記錄數非常多,使用並行掃描,效能升階非常明顯。

5、PostgreSQL支援partialindex,可以用於分區索引,或者部份索引。對於固定有條件的不等於查詢,效果非常顯著。

相關產品:

1. 安全管家

2. 雲資料庫RDS

3. DDoS高防IP

4. 雲端服務器ECS

相關文章

Beyond APAC's No.1 Cloud

19.6% IaaS Market Share in Asia Pacific - Gartner IT Service report, 2018

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。