Use local indexes to improve PostgreSQL Performance

Source: Internet
Author: User

Use local indexes to improve PostgreSQL Performance

You may not know that PostgreSQL supports partial indexing of table data? Its advantage is that it can accelerate the reading speed of the indexed data without additional overhead. the best way is to index the data that is read repeatedly based on the given WHERE clause. this is suitable for some specific analysis workflows that require aggregation computing in advance. in this post, I will give an example to illustrate how to optimize data queries through some indexes.

Assume that there is an event table with the following structure:

Each event is associated with a user with an ID, a timestamp, and a JSON file describing the event. the JSON content contains the page path, event category (such as click, Webpage Browsing, form submission), and other event-related attributes.

We use this table to store various Event Logs. assume that we have an automatic event tracker that can automatically record every click, page view, and form submission for further analysis. let's assume that we want to create an internal use report (internal dashboard) to display some valuable data (high-value metrics), such as the number of registered instances per week and accounts receivable per day. then the problem arises. events related to this report only account for a small portion of the event table data-although the site clicks are high, only a small portion of the final transactions are made! This small amount of transaction data is mixed with other data, that is, its signal-to-noise ratio is very low.

Now we want to improve the speed of REPORT query. [1] register an event first. We define it as a Form submission on the registration page (/signup. to obtain the number of registered users in the first week of March, we can understand that:

It takes 45 seconds to execute a query that contains 10 million records, of which only 3000 are registration records and have not been indexed.

Full Indexes: A hodgedge

The silly way to increase the query speed is to create a single column index (single-column index) for various event-related attributes: (data-> 'type '), (data-> 'path'), and time. through bitmap, we can combine the scan results of these three indexes. if we only selectively query some of the data, and the related indexes are still in memory, the query speed will become very fast. it takes about 200 milliseconds to start the query, and the subsequent query will be reduced to 20 milliseconds-compared to the sequential scan that takes 45 seconds to query, it does significantly improve.

This indexing method has several drawbacks:

  • Overhead of Data Writing. in this way, the data of the three indexes needs to be modified each time the INSERT, UPDATE, and DELETE operations are performed. [2] As a result, it is too costly to update data that requires frequent data writing.

  • Data Query restrictions. this method also limits our ability to customize high-value event types. for example, we cannot perform more complex queries on JSON fields than range queries. for example, you can search by regular expression or find the page starting with/signup.

  • Disk space usage. the table mentioned in this example occupies 6660 mb of disk space, and the three indexes are 1026 mb. Over time, these numbers will continue to soar. [3]

Partial Indexes)

Registration events used for analysis only account for 0.03% of all data in the table. Full indexes are used to index all data, which is obviously not suitable. To increase the query speed, the best way is to use local indexes.

Create an unrelated column index based on our registration event definition as a filtering condition. Through this index, PostgreSQL can easily find the row where the event is registered, the query speed is naturally much faster than the three full indexes in the relevant fields. In particular, perform a local index on the time field. The usage is as follows:

Create index event_signups ON event (time)
WHERE (data-> 'type') = 'submit 'AND (data-> 'path') ='/signup /'

The query speed of this index will be reduced from the first 200 ms to 2 ms. As long as you run more query statements, the speed will naturally increase. More importantly, the local index solves the shortcomings of the full index mentioned above.

  • Indexes only occupy 96 kb of disk space, which is 1026 of the full index's 1/10000 mb.

  • The index is updated only when the newly added row meets the filtering condition of the registration event. Since only 0.03% of eligible events are met, the performance of data writing has been greatly improved: basically, creating and updating indexes does not overhead much.

  • This partial join allows us to use the expressions provided by PostgreSQL as filter conditions. The WHERE clause used in the index is similar to the query statement, so we can write complicated filtering conditions. For example, regular expression, function return result, prefix match mentioned above.

Do not index the result to be a Boolean asserted

I have seen someone index a Boolean expression directly:

(Data-> 'type') = 'submit 'AND (data-> 'path') ='/signup /'

And then place the time field in the second item. For example:

Create index event_signup_time ON event
(Data-> 'type') = 'submit 'AND (data-> 'path') ='/signup/'), time)

The consequence of doing so is more serious than the above two methods, because the query planner of PostgreSQL does not regard this Boolean expression as a filter condition. That is to say, the scheduler does not regard it as a WHERE statement:

WHERE (data-> 'type') = 'submit 'AND (data-> 'path') ='/signup /'

Therefore, the index field is as follows:

(Data-> 'type') = 'submit 'AND (data-> 'path') ='/signup /')

Is always true. When we use this index as a condition to filter events, no matter whether the expression returns true or false, the event data will be read first, loaded, and then filtered. [4]

In this way, the index will read a lot of unnecessary data from the disk, and check the validity of each row of data. taking the dataset in our example as an example, this query takes 25 seconds for the first time, and then drops to 8 seconds. this result is worse than the entire index time field.

To a large extent, partial indexes can improve the query speed for filtering out part of the data in a table through assertions. for # postgresql IRC of the traffic-driven hero (Judging by traffic), local indexing seems to have insufficient resources. compared with the full index, the local index has a wider range of assertions (greater range of predicates). In combination with highly selective filters, the write operations and disk space will become less. if you frequently query a small part of data in a table, local indexes should be given priority.

Are you beginning to fall in love with PostgreSQL? To learn about its functions and features, please go here @ danlovesproofs.

Do you want to make powerful technologies easier to use? If you are interested, send us an email jobs@heapanalytics.com.

[1] This problem can be solved through Table Partitioning. separate high-value events from other values in a table and place them in different subtables. however, this method is not applicable if there are many types of valuable data, because every time you add a new type of valuable data, you must partition the table again.
[2] Using heap-only tuples optimization can greatly reduce the overhead of update operations. However, three indexes need to be updated each INSERT or DELETE operation.
[3] We can index three fields simultaneously by creating a 'multi-column Index. for example, on (data-> 'type'), (data-> 'path'), time ). this index accounts for 755 mb of disk space, which is 26% less than the disk space used to create three indexes. Other problems still exist. in addition, such indexes may be of no use for other queries of the same data. therefore, if we have several different types of valuable data, this advantage will not exist if we save disk space.
[4] The relevant query plan ):

------------------------------------ Lili split line ------------------------------------

Install PostgreSQL 6.3 on yum in CentOS 9.3

PostgreSQL cache details

Compiling PostgreSQL on Windows

Configuration and installation of LAPP (Linux + Apache + PostgreSQL + PHP) Environment in Ubuntu

Install and configure phppgAdmin on Ubuntu

Install PostgreSQL9.3 on CentOS

Configure a Streaming Replication cluster in PostgreSQL

How to install PostgreSQL 7/6 and phpPgAdmin in CentOS 5/6. 4

------------------------------------ Lili split line ------------------------------------

PostgreSQL details: click here
PostgreSQL: click here

This article permanently updates the link address:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.