When using array values in PostgreSQL, note that the postgresql Array

Source: Internet
Author: User
Tags macbook

When using array values in PostgreSQL, note that the postgresql Array

In Heap, we rely on PostgreSQL to support most backend heavy tasks. We store each event as an hstore blob. We maintain a PostgreSQL array of completed events for each tracked user, and sort these events by time. Hstore allows us to append attributes to events in a flexible manner, and event arrays give us powerful performance, especially for funnel queries, in these queries, we calculate the output between different conversion channel steps.

In this article, we will look at the PostgreSQL functions that accidentally accept a large number of input, and then rewrite them in an efficient and customary way.

Your first response may be to think of arrays in PostgreSQL as equivalent counterparts in C. You may have used the array position or slice to manipulate data before. But be careful not to have this idea in PostgreSQL, especially when the array type is longer, such as JSON, text, or hstore. If you access the PostgreSQL array by location, you will enter an unexpected performance slump.


This happened several weeks ago in Heap. In Heap, we maintain an array of events for each tracking user. In this array, we use an hstore datum to represent each event. We have an import pipeline to append new events to the corresponding array. In order to make this import pipeline idempotent, we set an event_id for each event. We use a function to repeatedly run our event array. If we want to update the attributes appended to the event, we only need to use the same event_id to dump a new event to the pipeline.

Therefore, we need a function to process the hstores array, and if the two events have the same event_id, we should use the most recent one in the array. At the beginning, this function was written as follows:
 

-- This is slow, and you don't want to use it!---- Filter an array of events such that there is only one event with each event_id.-- When more than one event with the same event_id is present, take the latest one.CREATE OR REPLACE FUNCTION dedupe_events_1(events HSTORE[]) RETURNS HSTORE[] AS $$ SELECT array_agg(event) FROM (  -- Filter for rank = 1, i.e. select the latest event for any collisions on event_id.  SELECT event  FROM (   -- Rank elements with the same event_id by position in the array, descending.

This query is measured on the i7CPU with GHz and the macbook pro with 16 GB Ram. The running script is: https://gist.github.com/drob/9180760.


What happened here? The key is that PostgreSQL stores a series of hstores as the value of the array, rather than the pointer to the value. An array containing three hstores looks like

{“event_id=>1,data=>foo”, “event_id=>2,data=>bar”, “event_id=>3,data=>baz”}

On the contrary

{[pointer], [pointer], [pointer]}

 

For variables of different lengths, for example. hstores, json blobs, varchars, or text fields, PostgreSQL must find the length of each variable. for evaluateevents [2], PostgreSQL parses the events read from the left until the data is read to the second time. then there is forevents [3]. She again scans from the first index until she reads the third data! Therefore, evaluatingevents [sub] is O (sub), and evaluatingevents [sub] is O (N2) for each index in the array, and N is the length of the array.

PostgreSQL can get more appropriate resolution results. It can analyze the array once in this case. the real answer is to implement the variable length elements and pointers with array values so that we can always process evaluateevents [I] within the same time.


Even so, we should not let PostgreSQL handle it, because this is not an authentic query. In addition to generate_subscripts, we can use unnest to parse the array and return a group of entries. In this way, we do not need to explicitly add indexes to the array.
 

-- Filter an array of events such that there is only one event with each event_id.-- When more than one event with the same event_id, is present, take the latest one.CREATE OR REPLACE FUNCTION dedupe_events_2(events HSTORE[]) RETURNS HSTORE[] AS $$ SELECT array_agg(event) FROM (  -- Filter for rank = 1, i.e. select the latest event for any collisions on event_id.  SELECT event  FROM (   -- Rank elements with the same event_id by position in the array, descending.   SELECT event, row_number AS index, rank()   OVER (PARTITION BY (event -> 'event_id')::BIGINT ORDER BY row_number DESC)   FROM (    -- Use unnest instead of generate_subscripts to turn an array into a set.    SELECT event, row_number()    OVER (ORDER BY event -> 'time')    FROM unnest(events) AS event   ) unnested_data  ) deduped_events  WHERE rank = 1  ORDER BY index ASC ) to_agg;$$ LANGUAGE SQL IMMUTABLE;

The result is valid, and the time it takes is linearly related to the size of the input array. It takes about half a second to input k elements, while the previous implementation takes 40 seconds.

This meets our needs:

  • Unnest is not required to parse the array at a time.
  • By event_id.
  • Use the latest appearance for each event_id.
  • Sort by input index.

Lesson: If you need to access a specific location of the PostgreSQL array, consider using unnest instead.

   SELECT events[sub] AS event, sub, rank()   OVER (PARTITION BY (events[sub] -> 'event_id')::BIGINT ORDER BY sub DESC)   FROM generate_subscripts(events, 1) AS sub  ) deduped_events  WHERE rank = 1  ORDER BY sub ASC ) to_agg;$$ LANGUAGE SQL IMMUTABLE;

This works, but the performance of large input is reduced. This is a second. It takes about 40 seconds to input an array with k elements!

This query is measured on the i7CPU with GHz and the macbook pro with 16 GB Ram. The running script is: https://gist.github.com/drob/9180760.


What happened here? The key is that PostgreSQL stores a series of hstores as the value of the array, rather than the pointer to the value. An array containing three hstores looks like

{“event_id=>1,data=>foo”, “event_id=>2,data=>bar”, “event_id=>3,data=>baz”}

On the contrary

{[pointer], [pointer], [pointer]}

 

For variables of different lengths, for example. hstores, json blobs, varchars, or text fields, PostgreSQL must find the length of each variable. for evaluateevents [2], PostgreSQL parses the events read from the left until the data is read to the second time. then there is forevents [3]. She again scans from the first index until she reads the third data! Therefore, evaluatingevents [sub] is O (sub), and evaluatingevents [sub] is O (N2) for each index in the array, and N is the length of the array.

PostgreSQL can get more appropriate resolution results. It can analyze the array once in this case. the real answer is to implement the variable length elements and pointers with array values so that we can always process evaluateevents [I] within the same time.


Even so, we should not let PostgreSQL handle it, because this is not an authentic query. In addition to generate_subscripts, we can use unnest to parse the array and return a group of entries. In this way, we do not need to explicitly add indexes to the array.
 

-- Filter an array of events such that there is only one event with each event_id.-- When more than one event with the same event_id, is present, take the latest one.CREATE OR REPLACE FUNCTION dedupe_events_2(events HSTORE[]) RETURNS HSTORE[] AS $$ SELECT array_agg(event) FROM (  -- Filter for rank = 1, i.e. select the latest event for any collisions on event_id.  SELECT event  FROM (   -- Rank elements with the same event_id by position in the array, descending.   SELECT event, row_number AS index, rank()   OVER (PARTITION BY (event -> 'event_id')::BIGINT ORDER BY row_number DESC)   FROM (    -- Use unnest instead of generate_subscripts to turn an array into a set.    SELECT event, row_number()    OVER (ORDER BY event -> 'time')    FROM unnest(events) AS event   ) unnested_data  ) deduped_events  WHERE rank = 1  ORDER BY index ASC ) to_agg;$$ LANGUAGE SQL IMMUTABLE;

The result is valid, and the time it takes is linearly related to the size of the input array. It takes about half a second to input k elements, while the previous implementation takes 40 seconds.

This meets our needs:

  • Unnest is not required to parse the array at a time.
  • By event_id.
  • Use the latest appearance for each event_id.
  • Sort by input index.

Lesson: If you need to access a specific location of the PostgreSQL array, consider using unnest instead.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.