2014-03-03 10:10 Anonymous Open source China compilation I want to comment (0) font size:T | T
In the heap, we rely on PostgreSQL to support most of the heavy back-end tasks, we store each event as a hstore blob, and we maintain a PostgreSQL array of completed events for each tracked user and sort the events by time.
AD:2014WOT Global Software Technology Summit Beijing Station course video release
In the heap, we rely on PostgreSQL to support most of the heavy back-end tasks, we store each event as a hstore blob, and we maintain a PostgreSQL array of completed events for each tracked user and sort the events by time. Hstore allows us to attach properties to events in a flexible way, and the array of events gives us powerful performance, especially for funnel queries, in which we calculate the output between different conversion channel steps.
In this article, we look at the PostgreSQL functions that accidentally accept a lot of input, and then rewrite it in an efficient, idiomatic way.
Your first reaction might be to look at the array in PostgreSQL as equivalent in the C language. You might have used a transform array position or slice to manipulate the data before. Be careful, however, not to have this idea in PostgreSQL, especially if the array type is longer, such as JSON, text, or Hstore. If you access the PostgreSQL array by location, you will enter a situation where unexpected performance crashes.
Therefore, we need a function function to handle the hstores array, and if two events have the same event_id, you should use the one that appears most recently in the array. The first attempt at this function is written like this:
-- this is slow, and you don ' t want to use it! -- -- filter an array of events such that there is only One event with each event_id. -- when more than one event with the same event_id is present, take the latest one. create or replace function dedupe_events_1 (events hstore[]) RETURNS Hstore[] as $$ select array_agg (event) FROM ( -- Filter for rank = 1, i.e. select the Latest event for any collisions on event_id. select event FROM ( -- rank elements with the same event_id by position in the array, descending. select events[sub] as event, sub, rank () OVER (partition by (events[sub] -> ' event_id '):: BIGINT ORDER BY SUB DESC) from generate_ subscripts (events, 1) AS sub ) deduped_events where rank = 1 order by sub asc ) to_agg; $$ LANGUAGE SQL IMMUTABLE;
This works, but the big input is performance down. this is two times, and it takes about 40 seconds for the input array to have 100K elements!
This query was measured on MacBook Pro with 2.4GHz of I7CPU and 16GB Ram, and ran the script: https://gist.github.com/drob/9180760.
What the hell is going on here? The point is that PostgreSQL stores a series of hstores as the value of the array, rather than a pointer to the value. An array containing three hstores looks like
{"Event_id=>1,data=>foo", "Event_id=>2,data=>bar", "Event_id=>3,data=>baz"}
The opposite is
{[pointer], [pointer], [pointer]}
for variables of varying lengths, for example. Hstores, JSON blobs, varchars, or text fields, PostgreSQL must find the length of each variable. For Evaluateevents[2], PostgreSQL parses the event read from the left until it reads the data that is read the second time. And then it's forevents[3], and she starts scanning again from the first index until it reads the third time! So, Evaluatingevents[sub] is O (sub), and Evaluatingevents[sub] is O (N2) for each index in the array, and n is the length of the array.
PostgreSQL can get more appropriate parsing results, which can be parsed in such a case once. The real answer is that variable-length elements are implemented with pointers to the values of the arrays, so that we can always handle evaluateevents[i] within the same time.
Even so, we should not let PostgreSQL handle it, because this is not an authentic query. In addition to generate_subscripts we can use Unnest, which parses the array and returns a set of entries. In this way, we do not need to explicitly include the index in the array.
-- filter an array of events such that there is only One event with each event_id. -- when more than one event with the same event_id, is present, take the latest one. create or replace function dedupe_events_2 (events hstore[]) RETURNS Hstore[] as $$ select array_agg (event) FROM ( -- Filter for rank = 1, i.e. select the Latest event for any collisions on event_id. select event FROM ( -- rank elements with the same event_id by position in the array, descending. &nBsp; select event, row_number as index, rank () OVER (partition by (event -> ' event_id '):: BIGINT  ORDER BY ROW_NUMBER DESC) FROM ( -- use unnest instead of generate_ subscripts to turn an array into a set. select event, row_number () over (order by event -> ' time ') from unnest (events) AS event ) unnested_ data ) deduped_events where rank = 1 order by index asc ) to_agg; $$ LANGUAGE SQL immutable;
The result is valid, and the time it takes is linearly related to the size of the input array. It takes about half a second for The input of 100K elements , and the previous implementation takes 40 seconds.
This has achieved our needs:
Lesson: If you need to access the specific location of the PostgreSQL array, consider using Unnest instead.
We hope to avoid mistakes. Have any comments or other PostgreSQL tips please @heap.
[1] In particular, we use a thoughtful tool called citus data. More content in another blog!
[2] Reference: Https://heapanalytics.com/features/funnels. In particular, the calculation translator needs to scan the array of user-completed events once, but no join is required.
Original link: http://blog.heapanalytics.com/dont-iterate-over-a-postgres-array-with-a-loop/
Link: http://www.oschina.net/translate/dont-iterate-over-a-postgres-array-with-a-loop
Array types that use PostgreSQL correctly