Some places to note when using arrays in PostgreSQL _ database other

Source: Internet
Author: User
Tags arrays constant json postgresql postgresql function macbook

In heap, we rely on PostgreSQL to support most back-end heavy tasks, we store each event as a hstore blob, and we maintain a PostgreSQL array of completed events for each tracked user and sort those events by time. Hstore allows us to attach attributes to events in a flexible way, and the event array gives us strong performance, especially for funnel queries, where we calculate output between different transformation channel steps.

In this article, we look at the PostgreSQL function that accidentally accepts a lot of input, and then rewrite it in an efficient, idiomatic way.

Your first reaction may be to consider the array in PostgreSQL as equivalent in C language. You may have used a transform array location or slice to manipulate the data. Be careful, however, not to have this idea in PostgreSQL, especially if the array type is longer, such as JSON, text, or Hstore. If you go through the location to access the PostgreSQL array, you will get into an unexpected performance plunge situation.


This happened in heap a few weeks ago. We maintain an array of events for each tracking user in heap, in which we represent each event with a Hstore datum. We have an import pipeline to append new events to the corresponding array. In order for this import pipeline to be idempotent, we set a event_id for each event, and we run our event array repeatedly through a function function. If we want to update the attributes attached to the event, we simply dump a new event into the pipeline using the same event_id.

So, we need a functional function to handle the hstores array, and if two events have the same event_id, you should use the one that appears most recently in the array. Just beginning to try this function is written in this way:

--This is slow, and don ' t want to-use it!
---Filter an array of events such that there are only one event with each
event_id.
than one event with the same event_id is present, take the latest one.
CREATE OR REPLACE FUNCTION dedupe_events_1 (Events hstore[]) RETURNS hstore[] as $$
 SELECT Array_agg (event) from
 (
  --Filter for rank = 1, i.e. select the latest event for no collisions on event_id.
  SELECT event
  from (
   --Rank elements with the same event_id by position in the array, descending.

This query is measured in a MacBook Pro with 2.4GHz i7cpu and 16GB RAM, and runs the script: https://gist.github.com/drob/9180760.


What the hell is going on here? The key is that PostgreSQL stores a series of hstores as the value of the array, not a pointer to the value. An array that contains three hstores looks like

{"Event_id=>1,data=>foo", "Event_id=>2,data=>bar", "Event_id=>3,data=>baz"}

The opposite is

{[pointer], [pointer], [pointer]}

For variables of varying lengths, give an example. Hstores, JSON blobs, varchars, or text fields, PostgreSQL must find the length of each variable. For evaluateevents[2], PostgreSQL resolves events that are read from the left until the second read. Then it was forevents[3], and she started scanning again from the first index until she read the third time! So, Evaluatingevents[sub] is an O (sub), and Evaluatingevents[sub] is an O (N2) for each index in the array, and n is the length of the array.

PostgreSQL can get a more appropriate analytic result, it can analyze the array once in such a case. The real answer is that variable-length elements and pointers are implemented in an array of values, so that we can always handle evaluateevents[i] in constant time.


Even so, we should not let PostgreSQL to deal with, because this is not a tunnel query. In addition to generate_subscripts we can use Unnest, which parses the array and returns a set of entries. In this way, we don't need to explicitly add an index to the array.

--Filter an array of events such that there was only one event with each event_id.--whe
N more than one event with the same event_id, is present, take the latest one. CREATE OR REPLACE FUNCTION dedupe_events_2 (Events hstore[]) RETURNS hstore[] as $$ SELECT Array_agg (event) from (----F
  Ilter for rank = 1, i.e. select the latest event for all collisions on event_id.
   SELECT event from (--Rank elements with the same event_id by position in the array, descending. 
   SELECT event, row_number as index, rank () over (PARTITION by (event-> ' event_id '):: BIGINT ORDER by Row_number DESC)
    From (----the use unnest instead of the generate_subscripts to turn a set. SELECT event, Row_number () over (order by event-> ' time ') from Unnest (events) as event) Unnested_data)
deduped_events WHERE rank = 1 ORDER by index ASC) To_agg;
$$ LANGUAGE SQL immutable; 

The result is valid, and the time it takes is linearly related to the size of the input array. It takes about half a second for the input of a 100K element, and the previous implementation takes 40 seconds.

This has fulfilled our needs:

    • Parse an array at once without needing unnest.
    • Divided by event_id.
    • Use the latest appearance for each event_id.
    • Sort by input index.

Lesson: If you need to access a specific location of the PostgreSQL array, consider using Unnest instead.

   SELECT Events[sub] As event, Sub, rank () over
   (PARTITION by (events[sub)-> ' event_id '):: BIGINT ORDER by Sub DESC) C3/>from generate_subscripts (events, 1) as sub
  ) deduped_events
  WHERE rank = 1 ORDER by
  Sub ASC
 ) to_agg;< c8/>$$ LANGUAGE SQL immutable;

This works, but the big input is performance degradation. This is two times, and it takes about 40 seconds for the input array to have 100K elements!

This query is measured in a MacBook Pro with 2.4GHz i7cpu and 16GB RAM, and runs the script: https://gist.github.com/drob/9180760.


What the hell is going on here? The key is that PostgreSQL stores a series of hstores as the value of the array, not a pointer to the value. An array that contains three hstores looks like

{"Event_id=>1,data=>foo", "Event_id=>2,data=>bar", "Event_id=>3,data=>baz"}

The opposite is

{[pointer], [pointer], [pointer]}

For variables of varying lengths, give an example. Hstores, JSON blobs, varchars, or text fields, PostgreSQL must find the length of each variable. For evaluateevents[2], PostgreSQL resolves events that are read from the left until the second read. Then it was forevents[3], and she started scanning again from the first index until she read the third time! So, Evaluatingevents[sub] is an O (sub), and Evaluatingevents[sub] is an O (N2) for each index in the array, and n is the length of the array.

PostgreSQL can get a more appropriate analytic result, it can analyze the array once in such a case. The real answer is that variable-length elements and pointers are implemented in an array of values, so that we can always handle evaluateevents[i] in constant time.


Even so, we should not let PostgreSQL to deal with, because this is not a tunnel query. In addition to generate_subscripts we can use Unnest, which parses the array and returns a set of entries. In this way, we don't need to explicitly add an index to the array.

--Filter an array of events such that there was only one event with each event_id.--whe
N more than one event with the same event_id, is present, take the latest one. CREATE OR REPLACE FUNCTION dedupe_events_2 (Events hstore[]) RETURNS hstore[] as $$ SELECT Array_agg (event) from (----F
  Ilter for rank = 1, i.e. select the latest event for all collisions on event_id.
   SELECT event from (--Rank elements with the same event_id by position in the array, descending. 
   SELECT event, row_number as index, rank () over (PARTITION by (event-> ' event_id '):: BIGINT ORDER by Row_number DESC)
    From (----the use unnest instead of the generate_subscripts to turn a set. SELECT event, Row_number () over (order by event-> ' time ') from Unnest (events) as event) Unnested_data)
deduped_events WHERE rank = 1 ORDER by index ASC) To_agg;
$$ LANGUAGE SQL immutable; 

The result is valid, and the time it takes is linearly related to the size of the input array. It takes about half a second for the input of a 100K element, and the previous implementation takes 40 seconds.

This has fulfilled our needs:

    • Parse an array at once without needing unnest.
    • Divided by event_id.
    • Use the latest appearance for each event_id.
    • Sort by input index.

Lesson: If you need to access a specific location of the PostgreSQL array, consider using Unnest instead.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.