摘要: 標籤PostgreSQL , 時序資料庫, 時序, 滑動視窗, 遞迴查詢, subquery , 視窗查詢, 求最新值幕後在很多場景中,都會有資料合併、清洗的需求。例如:1、記錄了表的變更明細(insert,update,delete),需要合併明細,從明細中快速取到每個PK的最新值。
幕後
在很多場景中,都會有資料合併、清洗的需求。
例如:
1、記錄了表的變更明細(insert,update,delete),需要合併明細,從明細中快速取到每個PK的最新值。
2、有很多感應器,不斷的在上報資料,要快速的取出每個感應器的最新狀態。
對於這類需求,可以使用視窗查詢,但是如何加速,如何快速的取出批量資料呢?
PostgreSQL是最進階的開來源資料庫,優化方法之多,超乎你的想像。
時序資料取值優化
1、唯一值較少時,並且唯一值範圍未知時,使用遞迴。
方法如下:
《PostgrSQL遞迴SQL的幾個套用 - 極客與正常人的思維》
《時序資料合併場景加速剖析和實現 - 複合索引,視窗分組查詢加速,變態遞迴加速》
2、唯一值較少時,並且唯一值範圍確定時,使用subquery。
方法如下:
《PostgreSQL海量時序資料(任意滑動視窗即時統計剖析) - 感應器、人群、物體等物件追蹤》
3、唯一值較多時,1 使用視窗查詢,比前面的方法更加適合。
4、唯一值較多時,2 使用流式計算,比方法3更加優秀。
方法如下:
《(流式、lambda、觸發程序)即時處理大比拼 - 物聯網(IoT)金融,時序處理最佳實踐》
本文將對前三種方法做一個比較。
方法4 流計算就不用比了,因為它什麼時候都是最強大的。通殺所有場景。等pipelineDB 外掛程式化吧,阿裡雲RDS PG 10會整合pipelineDB的功能。
遞迴 vs subquery vs 視窗
以500萬資料為例,對比這幾種方法的調整場景。
1 遞迴
場景一、唯一值較多(100萬唯一值)
1、建表
iming
drop table test;
create unlogged table test(id int , info text, crt_time timestamp);
2、建構資料
insert into test select ceil(random()*1000000), md5(random()::text), clock_timestamp() from generate_series(1,5000000);
3、建立索引
create index idx_test_1 on test (id, crt_time desc);
4、遞迴查詢效率
explain (analyze,verbose,timing,costs,buffers) with recursive skip as (
(
select test as v from test where id in (select id from test where id is not null order by id,crt_time desc limit 1) limit 1
)
union all
(
select (
select t as v from test t where t.id>(s.v).id and t.id is not null order by id,crt_time desc limit 1
) from skip s where (s.v).id is not null
)--這裡的where (s.v).id is not null一定要加,否則就閉環了.
)
select (t.v).id, (t.v).info, (t.v).crt_time from skip t where t.* is not null;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on skip t(cost=54.35..56.37 rows=100 width=44) (actual time=0.042..6626.084 rows=993288 loops=1)
Output: (t.v).id, (t.v).info, (t.v).crt_time
Filter: (t.* IS NOT NULL)
Rows Removed by Filter: 1
Buffers: shared hit=3976934
CTE skip
->Recursive Union(cost=0.91..54.35 rows=101 width=69) (actual time=0.034..6006.615 rows=993289 loops=1)
Buffers: shared hit=3976934
->Limit(cost=0.91..0.93 rows=1 width=69) (actual time=0.033..0.033 rows=1 loops=1)
Output: test.*
Buffers: shared hit=8
->Nested Loop(cost=0.91..10.19 rows=500 width=69) (actual time=0.032..0.032 rows=1 loops=1)
Output: test.*
Buffers: shared hit=8
->HashAggregate(cost=0.48..0.49 rows=1 width=4) (actual time=0.021..0.021 rows=1 loops=1)
Output: test_1.id
Group Key: test_1.id
Buffers: shared hit=4
->Limit(cost=0.43..0.47 rows=1 width=12) (actual time=0.016..0.016 rows=1 loops=1)
Output: test_1.id, test_1.crt_time
Buffers: shared hit=4
->Index Only Scan using idx_test_1 on public.test test_1(cost=0.43..173279.36 rows=5000002 width=12) (actual time=0.015..0.015 rows=1 loops=1)
Output: test_1.id, test_1.crt_time
Index Cond: (test_1.id IS NOT NULL)
Heap Fetches: 1
Buffers: shared hit=4
->Index Scan using idx_test_1 on public.test(cost=0.43..9.64 rows=6 width=73) (actual time=0.009..0.009 rows=1 loops=1)
Output: test.*, test.id
Index Cond: (test.id = test_1.id)
Buffers: shared hit=4
->WorkTable Scan on skip s(cost=0.00..5.14 rows=10 width=32) (actual time=0.006..0.006 rows=1 loops=993289)
Output: (SubPlan 1)
Filter: ((s.v).id IS NOT NULL)
Rows Removed by Filter: 0
Buffers: shared hit=3976926
SubPlan 1
->Limit(cost=0.43..0.49 rows=1 width=81) (actual time=0.005..0.005 rows=1 loops=993288)
Output: t_1.*, t_1.id, t_1.crt_time
Buffers: shared hit=3976926
->Index Scan using idx_test_1 on public.test t_1(cost=0.43..102425.17 rows=1666667 width=81) (actual time=0.005..0.005 rows=1 loops=993288)
Output: t_1.*, t_1.id, t_1.crt_time
Index Cond: ((t_1.id > (s.v).id) AND (t_1.id IS NOT NULL))
Buffers: shared hit=3976926
Planning time: 0.354 ms
Execution time: 6706.105 ms
(45 rows)
場景二、唯一值較少(1000唯一值)
1、建表
iming
drop table test;
create unlogged table test(id int , info text, crt_time timestamp);
2、建構資料
insert into test select ceil(random()*1000), md5(random()::text), clock_timestamp() from generate_series(1,5000000);
3、建立索引
create index idx_test_1 on test (id, crt_time desc);
4、遞迴查詢效率
查詢語句不變
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on skip t(cost=55.09..57.11 rows=100 width=44) (actual time=0.046..8.859 rows=1000 loops=1)
Output: (t.v).id, (t.v).info, (t.v).crt_time
Filter: (t.* IS NOT NULL)
Rows Removed by Filter: 1
Buffers: shared hit=4007
CTE skip
->Recursive Union(cost=0.91..55.09 rows=101 width=69) (actual time=0.039..8.203 rows=1001 loops=1)
Buffers: shared hit=4007
->Limit(cost=0.91..1.67 rows=1 width=69) (actual time=0.038..0.038 rows=1 loops=1)
Output: test.*
Buffers: shared hit=8
->Nested Loop(cost=0.91..6335.47 rows=8333 width=69) (actual time=0.038..0.038 rows=1 loops=1)
Output: test.*
Buffers: shared hit=8
->HashAggregate(cost=0.48..0.49 rows=1 width=4) (actual time=0.021..0.021 rows=1 loops=1)
Output: test_1.id
Group Key: test_1.id
Buffers: shared hit=4
->Limit(cost=0.43..0.47 rows=1 width=12) (actual time=0.016..0.017 rows=1 loops=1)
Output: test_1.id, test_1.crt_time
Buffers: shared hit=4
->Index Only Scan using idx_test_1 on public.test test_1(cost=0.43..173279.55 rows=5000002 width=12) (actual time=0.015..0.015 rows=1 loops=1)
Output: test_1.id, test_1.crt_time
Index Cond: (test_1.id IS NOT NULL)
Heap Fetches: 1
Buffers: shared hit=4
->Index Scan using idx_test_1 on public.test(cost=0.43..6284.98 rows=5000 width=73) (actual time=0.015..0.015 rows=1 loops=1)
Output: test.*, test.id
Index Cond: (test.id = test_1.id)
Buffers: shared hit=4
->WorkTable Scan on skip s(cost=0.00..5.14 rows=10 width=32) (actual time=0.008..0.008 rows=1 loops=1001)
Output: (SubPlan 1)
Filter: ((s.v).id IS NOT NULL)
Rows Removed by Filter: 0
Buffers: shared hit=3999
SubPlan 1
->Limit(cost=0.43..0.49 rows=1 width=81) (actual time=0.007..0.007 rows=1 loops=1000)
Output: t_1.*, t_1.id, t_1.crt_time
Buffers: shared hit=3999
->Index Scan using idx_test_1 on public.test t_1(cost=0.43..102425.80 rows=1666667 width=81) (actual time=0.007..0.007 rows=1 loops=1000)
Output: t_1.*, t_1.id, t_1.crt_time
Index Cond: ((t_1.id > (s.v).id) AND (t_1.id IS NOT NULL))
Buffers: shared hit=3999
Planning time: 0.353 ms
Execution time: 8.980 ms
(45 rows)
2 subquery
場景一、唯一值較多(100萬唯一值)
1、subquery查詢效率
如果ID的取值範圍特別廣,SUBQUERY就很不划算。
需要維修一張唯一ID表,這裡使用generate_series來代替這張表,用於測試。
explain (analyze,verbose,timing,costs,buffers) select (select test from test where id=t.id order by crt_time desc limit 1) from generate_series(1,1000000) t(id);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on pg_catalog.generate_series t(cost=0.00..1976.65 rows=1000 width=32) (actual time=70.682..2835.109 rows=1000000 loops=1)
Output: (SubPlan 1)
Function Call: generate_series(1, 1000000)
Buffers: shared hit=3997082
SubPlan 1
->Limit(cost=0.43..1.97 rows=1 width=77) (actual time=0.002..0.002 rows=1 loops=1000000)
Output: test.*, test.crt_time
Buffers: shared hit=3997082
->Index Scan using idx_test_1 on public.test(cost=0.43..9.64 rows=6 width=77) (actual time=0.002..0.002 rows=1 loops=1000000)
Output: test.*, test.crt_time
Index Cond: (test.id = t.id)
Buffers: shared hit=3997082
Planning time: 0.119 ms
Execution time: 2892.712 ms
(14 rows)
場景二、唯一值較少(1000唯一值)
1、subquery查詢效率
查詢語句有變
explain (analyze,verbose,timing,costs,buffers) select (select test from test where id=t.id order by crt_time desc limit 1) from generate_series(1,1000) t(id);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on pg_catalog.generate_series t(cost=0.00..1699.41 rows=1000 width=32) (actual time=0.107..7.041 rows=1000 loops=1)
Output: (SubPlan 1)
Function Call: generate_series(1, 1000)
Buffers: shared hit=4000
SubPlan 1
->Limit(cost=0.43..1.69 rows=1 width=77) (actual time=0.006..0.007 rows=1 loops=1000)
Output: test.*, test.crt_time
Buffers: shared hit=4000
->Index Scan using idx_test_1 on public.test(cost=0.43..6284.98 rows=5000 width=77) (actual time=0.006..0.006 rows=1 loops=1000)
Output: test.*, test.crt_time
Index Cond: (test.id = t.id)
Buffers: shared hit=4000
Planning time: 0.131 ms
Execution time: 7.126 ms
(14 rows)
3 視窗查詢
場景一、唯一值較多(100萬唯一值)
1、視窗查詢效率
explain (analyze,verbose,timing,costs,buffers) select id,info,crt_time from (select row_number() over (partition by id order by crt_time desc) as rn, * from test) t where rn=1;
postgres=# explain (analyze,verbose,timing,costs,buffers) select id,info,crt_time from (select row_number() over (partition by id order by crt_time desc) as rn, * from test) t where rn=1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on t(cost=0.43..310779.41 rows=25000 width=45) (actual time=0.027..6398.308 rows=993288 loops=1)
Output: t.id, t.info, t.crt_time
Filter: (t.rn = 1)
Rows Removed by Filter: 4006712
Buffers: shared hit=5018864
->WindowAgg(cost=0.43..248279.39 rows=5000002 width=53) (actual time=0.026..5973.497 rows=5000000 loops=1)
Output: row_number() OVER (?), test.id, test.info, test.crt_time
Buffers: shared hit=5018864
->Index Scan using idx_test_1 on public.test(cost=0.43..160779.35 rows=5000002 width=45) (actual time=0.019..4058.476 rows=5000000 loops=1)
Output: test.id, test.info, test.crt_time
Buffers: shared hit=5018864
Planning time: 0.121 ms
Execution time: 6446.901 ms
(13 rows)
場景二、唯一值較少(1000唯一值)
1、視窗查詢效率
查詢語句不變
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on t(cost=0.43..310779.61 rows=25000 width=45) (actual time=0.027..6176.801 rows=1000 loops=1)
Output: t.id, t.info, t.crt_time
Filter: (t.rn = 1)
Rows Removed by Filter: 4999000
Buffers: shared hit=4744850 read=18157
->WindowAgg(cost=0.43..248279.58 rows=5000002 width=53) (actual time=0.026..5822.576 rows=5000000 loops=1)
Output: row_number() OVER (?), test.id, test.info, test.crt_time
Buffers: shared hit=4744850 read=18157
->Index Scan using idx_test_1 on public.test(cost=0.43..160779.55 rows=5000002 width=45) (actual time=0.020..4175.082 rows=5000000 loops=1)
Output: test.id, test.info, test.crt_time
Buffers: shared hit=4744850 read=18157
Planning time: 0.108 ms
Execution time: 6176.924 ms
(13 rows)
橫向效率對比圖
資料量 |
唯一值個數 |
視窗查詢(ms) |
subquery(ms) |
遞迴查詢(ms) |
500萬 |
100萬 |
6446 |
2892 |
6706 |
500萬 |
100萬 |
6176 |
7 |
9 |
雲端產品
阿裡雲 RDSPostgreSQL
阿裡雲 HybridDB for PostgreSQL
類似案例
《PostgrSQL遞迴SQL的幾個套用 - 極客與正常人的思維》
《時序資料合併場景加速剖析和實現 - 複合索引,視窗分組查詢加速,變態遞迴加速》
《(流式、lambda、觸發程序)即時處理大比拼 - 物聯網(IoT)金融,時序處理最佳實踐》
《PostgreSQL海量時序資料(任意滑動視窗即時統計剖析) - 感應器、人群、物體等物件追蹤》
《車連網案例,曲目清洗 - 阿裡雲RDSPostgreSQL最佳實踐 - 視窗函數》
《PostgreSQL時序最佳實踐 - 證券交易系統資料庫設計 - 阿裡雲RDSPostgreSQL最佳實踐》
http://www.timescale.com/
小結
隨著物聯網的發展,越來越多的時序資料產生,求時序資料的最新值,滑動視窗內的最新值,已經成為時序商務裡面非常場景的需求。
PostgreSQL是最先進的開來源資料庫,它不像很多資料庫解決問題往往只有一種方法。PostgreSQL解決一個問題有多種方法,什麼方法最優,就看你對它的瞭解了。
1、唯一值較少時,並且唯一值範圍未知時,使用遞迴。
2、唯一值較少時,並且唯一值範圍確定時(例如範圍是100萬,但是此批資料只出現了50萬個,那麼如果你有這50萬個的ID,效能是最好的,否則需要掃100萬。例如用戶數是1億,一個區間的活躍用戶可能只有幾萬。),使用subquery。
3、唯一值較多時,1 使用視窗查詢,比前面的方法更加適合。
4、唯一值較多時,2 使用流式計算,比方法3更加優秀。
相關產品:
1. 安全管家
2. 雲資料庫RDS
3. 資料管理
4. 雲端服務器ECS