Optimization of highly concurrent and low-base multi-field queries in any combination _ PHP Tutorial

Source: Internet
Author: User
Optimization of queries with any combination of multiple fields with high concurrency and low base. Optimization of highly concurrent and low-base multi-field queries with any combination 1. the problem first explains what the low-base multi-field queries with any combination refer to in this title. This refers to the optimization of queries that meet any combination of multiple fields with high concurrency and low base.
1. the question first explains what the "low-base multi-field combined query" in this title refers. Queries that meet the following conditions:
1. a combination of multiple field conditions involved in a search condition
2. the combination of these fields is uncertain.
3. poor selectivity for each individual field

This type of query has many application scenarios, such as the e-commerce product display page. Users will enter a variety of query condition combinations: categories, suppliers, brands, promotions, prices, etc., and finally sort and pagination the results.

The headache of this type of problem is:
1. there are a large number of records. if the full table scan performance is low, it cannot meet the high concurrency access requirements.
2. the selectivity of any single field involved in the query condition is very low, and the query efficiency cannot be solved through a single field index.
3. if a common B-tree multi-field index is created, hundreds of thousands of indexes may be created due to too many combinations of input conditions, which is neither realistic nor difficult to maintain.

2. there are two solutions I have come up with for this type of problem.
2.1bitmap indexes bitmap stores keys and bitmaps of all row sets with the same value as the key. for queries involving combinations of multiple keys, you only need to perform bitmap operations on the corresponding keys. Because bitmap has a small size and a high efficiency in bit and or operations, bitmap is very suitable for such queries.
Bitmap indexes also have disadvantages. updating a record locks the entire table and is not suitable for scenarios where many data types are merged and written. Another problem is that there seems to be only one Oracle database that supports bitmap indexing in common relational databases, and we often want to use open-source databases.

2.2 Inverted Indexes inverted indexes have similarities with bitmap. they store the row set with the key and the value equal to the key. the row set may be a list, a tree, or other storage form. For a combined query of multiple keys, perform a set operation on the results of these keys.
Inverted Indexes are generally used for full-text searches, but many systems also use them to support searches for structured data, such as Elasticsearch. Elasticsearch supports quick search of JSON documents, composite query, sorting, aggregation, distributed deployment, and many other good features. But considering the following factors, we prefer to find a solution in the relational database.
-You do not need to use search engines to provide advanced features for fuzzy matching. In fact, we need exact matching or simple fuzzy matching.
-A distributed search cluster is required because the data volume is not large.
-Raw data is already in a relational database, so you don't need to worry about data synchronization.
-You have already developed applications based on relational database interfaces.
-You have mastered the O & M Management of relational databases and do not know how many pitfalls you need to step on for a brand new system.
-Considering the differences in Java and C performance, the performance of the built-in relational database solution may not be inferior to that of professional search engines.

3. if the PostgreSQL solution limits the scope of the solution to open-source relational databases, there may be only one answer, I .e., the gin index of PostgreSQL.
The gin index of PostgreSQL is an inverted index, which is used not only for full-text retrieval but also for conventional data types, such as int and varchar.
For multi-dimensional queries, we can create an index as follows:
1. create a unique multi-field gin index for all low-base fields involved in Equivalent conditions
2. for fields involved in the selective equivalence query or range query, create a btree index.

You may have questions about the multi-field index. why do you only need to create one gin multi-field index, while the multi-field index of btree must consider the combination of various queries to create several. This is because each field in the gin multi-field index is equivalent and there is no argument about the leading field. Therefore, you only need to create a unique gin multi-field index to overwrite all query combinations; the multi-field index of B-tree is different. if the query condition does not contain the suoyi leading field, the index cannot be used.

Each key stored in the gin index of multiple fields is in the form of column number and key datum. Therefore, different fields can be distinguished without confusion. The stored value is the ctid set of all records matching the key. This set is stored in the form of btree with a large number of records and is compressed. Therefore, the storage space occupied by gin indexes is very small, and only about 1/20 of the equivalent btree indexes are used, this also improves the performance.

For multiple fields involved in multi-dimensional queries, fields contained in the multi-field gin index are merged by the gin index as the ctid set (take the union set or intersection ), then, the ctid set and the ctid set obtained by other indexes are merged into BitmapAnd or BitmapOr. The efficiency of merging the gin D set inside the gin index is much higher than that of the ctid set merge between indexes, and the gin index has better optimization for low base fields, therefore, making full use of gin indexes is much more efficient than creating a btree index for each field and then merging result sets through BitmapAnd or BitmapOr.


4. a real case 4.1 the following SQL statement in the original query is a simplified version of a real SQL statement in a system.

  1. Select case when gpppur. GB_BEGINDATE <= '2017-02-29 14:36:00 'AND gpppur. GB_ENDDATE> '2017-02-29 14:36:00' THEN 1
  2. WHEN gpppur. PREVIEW_BEGINDT <= '2017-02-29 14:36:00 'AND gpppur. PREVIEW_ENDDT> '2017-02-29 14:36:00' THEN 2
  3. ELSE 3 end as flag,
  4. Gpppur .*
  5. FROM T_MPS_INFO gpppur
  6. WHERE gpppur. ATTRACT_TP = 0
  7. AND gpppur. COLUMN_ID = 1
  8. AND gpppur. FIELD2 = 1
  9. AND gpppur. STATUS = 1
  10. Order by flag ASC, gpppur. PC_SORT_NUM ASC, gpppur. GB_BEGINDATE DESC
  11. LIMIT 0, 45
The MySQL database is used, and the total data volume is 60 w. The FIELD2 + STATUS multi-field index is built.
The value distribution of the four fields involved in the query condition is as follows:

  1. Postgres = # select ATTRACT_TP, count (*) from T_MPS_INFO group by ATTRACT_TP;
  2. Attract_tp | count
  3. ------------ + --------
  4. | 16196
  5. 6 | 251
  6. 2 | 50
  7. 1 | 3692
  8. 3 | 143
  9. 10 | 314
  10. 4 | 214
  11. 5 | 194333
  12. 9 | 326485
  13. 7 | 1029
  14. 0 | 6458
  15. (11 rows)

  16. Postgres = # select COLUMN_ID, count (*) from T_MPS_INFO group by COLUMN_ID;
  17. Column_id | count
  18. ------------ + --------
  19. | 2557
  20. 285 | 20
  21. 120 | 194
  22. 351 | 2
  23. 337 | 79
  24. 227 | 26
  25. 311 | 9
  26. 347 | 2
  27. 228 | 21
  28. 318 | 1
  29. 314 | 9
  30. 54 | 10
  31. 133 | 27
  32. 2147483647 | 1
  33. 336 | 1056
  34. 364 | 1
  35. 131 | 10
  36. 243 | 5
  37. 115 | 393
  38. 61 | 73
  39. 226 | 40
  40. 196 | 16
  41. 350 | 5
  42. 373 | 72
  43. 377 | 2
  44. 260 | 4
  45. 184 | 181
  46. 363 | 1
  47. 341 | 392
  48. 64 | 1
  49. 344 | 199271
  50. 235 | 17
  51. 294 | 755
  52. 352 | 3
  53. 368 | 1
  54. 225 | 1
  55. 199 | 8
  56. 374 | 2
  57. 248 | 8
  58. 84 | 1
  59. 362 | 1
  60. 361 | 331979
  61. 319 | 7
  62. 244 | 65
  63. 125 | 2
  64. 130 | 1
  65. 272 | 65
  66. 66 | 2
  67. 240 | 2
  68. 775 | 1
  69. 253 | 49
  70. 60 | 45
  71. 121 | 5
  72. 257 | 3
  73. 365 | 1
  74. 0 | 1
  75. 217 | 5
  76. 270 | 1
  77. 122 | 39
  78. 56 | 49
  79. 355 | 5
  80. 161 | 1
  81. 329 | 1
  82. 222 | 9
  83. 261 | 275
  84. 2 | 3816
  85. 57 | 19
  86. 307 | 4
  87. 310 | 8
  88. 97 | 37
  89. 202 | 20
  90. 203 | 3
  91. 85 | 1
  92. 375 | 641
  93. 58 | 98
  94. 1 | 6479
  95. 59 | 114
  96. 185 | 7
  97. 338 | 10
  98. 379 | 17
  99. (80 rows)

  100. S = # select FIELD2, count (*) from T_MPS_INFO group by FIELD2;
  101. Field2 | count
  102. -------- + --------
  103. | 2297
  104. 6 | 469
  105. 2 | 320
  106. 1 | 11452
  107. 3 | 286
  108. 10 | 394
  109. 4 | 291
  110. 5 | 200497
  111. 9 | 331979
  112. 0 | 2
  113. 7 | 1178
  114. (11 rows)

  115. Postgres = # select STATUS, count (*) from T_MPS_INFO group by STATUS;
  116. Status | count
  117. -------- + --------
  118. | 2297
  119. 0 | 15002
  120. 3 | 5
  121. 4 | 1
  122. 1 | 531829
  123. 2 | 31
  124. (6 rows)

Because the values of these fields are extremely unevenly distributed, we construct the following lua script to generate different select statements to simulate the load.
Qx. lua:

  1. Pathtest = string. match (test, "(. */)") or ""

  2. Dofile (pathtest .. "common. lua ")

  3. Function thread_init (thread_id)
  4. Set_vars ()
  5. End

  6. Function event (thread_id)
  7. Local ATTRACT_TP, COLUMN_ID, FIELD2, STATUS
  8. ATTRACT_TP = sb_rand_uniform (0, 10)
  9. COLUMN_ID = sb_rand_uniform (1,100)
  10. FIELD2 = sb_rand_uniform (0, 10)
  11. STATUS = sb_rand_uniform (0, 4)

  12. Rs = db_query ("select case when gpppur. GB_BEGINDATE <= '2017-02-29 14:36:00 'AND gpppur. GB_ENDDATE> '2017-02-29 14:36:00' THEN 1
  13. WHEN gpppur. PREVIEW_BEGINDT <= '2017-02-29 14:36:00 'AND gpppur. PREVIEW_ENDDT> '2017-02-29 14:36:00' THEN 2
  14. ELSE 3 end as flag,
  15. Gpppur .*
  16. FROM T_MPS_INFO gpppur
  17. WHERE gpppur. ATTRACT_TP = "... ATTRACT_TP .."
  18. AND gpppur. COLUMN_ID = "... COLUMN_ID .."
  19. AND gpppur. FIELD2 = "... FIELD2 .."
  20. AND gpppur. STATUS = "... STATUS .."
  21. Order by flag ASC, gpppur. PC_SORT_NUM ASC, gpppur. GB_BEGINDATE DESC
  22. LIMIT 45 ")
  23. End

Then, use sysbench for stress testing. the qps measured at 32 concurrency is 64.

  1. [Root @ rh6375Gt20150507 ~] # Sysbench -- db-driver = mysql -- test =/opt/sysbench-0.5/sysbench/tests/db/qx. lua -- mysql-db = test -- mysql-user = mysql -- mysql-password = mysql -- mysql-host = srdsdevapp69 -- num-threads = 32 -- max-time = 5 run
  2. Sysbench 0.5: multi-threaded system evaluation benchmark
  3. Running the test with following options:
  4. Number of threads: 32
  5. Random number generator seed is 0 and will be ignored
  6. Threads started!
  7. OLTP test statistics:
  8. Queries saved Med:
  9. Read: 825
  10. Write: 0
  11. Other: 0
  12. Total: 825
  13. Transactions: 0 (0.00 per sec .)
  14. Read/write requests: 825 (64.20 per sec .)
  15. Other operations: 0 (0.00 per sec .)
  16. Ignored errors: 0 (0.00 per sec .)
  17. Reconnects: 0 (0.00 per sec .)
  18. General statistics:
  19. Total time: 12.8496 s
  20. Total number of events: 825
  21. Total time taken by event execution: 399.6003 s
  22. Response time:
  23. Min: 1.01 ms
  24. Avg: 484.36 ms
  25. Max: 12602.74 ms
  26. Approx. 95 percentile: 222.79 ms
  27. Threads fairness:
  28. Events (avg/stddev): 25.7812/24.12
  29. Execution time (avg/stddev): 12.4875/0.23

4.2 After optimization, for the specific SQL above, although we can create four fields (ATTRACT_TP, COLUMN_ID, FIELD2, STATUS) that contain all the equivalent query conditions) this SQL statement is only one of the more than 1000 different SQL statements produced by various query combinations. The combination of query fields involved in each SQL statement is different, we cannot create a multi-field index for each combination.
So we thought of the gin index of PostgreSQL. To use the gin index of PostgreSQL, we first migrate the MySQL table definitions, indexes, and data to PostgreSQL.
Before adding a gin index, we did a test. Another surprise is that the optimization has not yet started, and the performance measured by PostgreSQL is already 5 times that of MySQL (335/64 = 5.

  1. [Root @ rh6375Gt20150507 ~] # Sysbench -- db-driver = pgsql -- test =/opt/sysbench-0.5/sysbench/tests/db/qx. lua -- pgsql-db = S -- pgsql-user = postgres -- pgsql-password = S -- pgsql-host = srdsdevapp69 -- num-threads = 32 -- max-time = 5 run
  2. Sysbench 0.5: multi-threaded system evaluation benchmark

  3. Running the test with following options:
  4. Number of threads: 32
  5. Random number generator seed is 0 and will be ignored


  6. Threads

  7. OLTP test statistics:
  8. Queries saved Med:
  9. Read: 1948
  10. Write: 0
  11. Other: 0
  12. Total: 1948
  13. Transactions: 0 (0.00 per sec .)
  14. Read/write requests: 1948 (335.52 per sec .)
  15. Other operations: 0 (0.00 per sec .)
  16. Ignored errors: 0 (0.00 per sec .)
  17. Reconnects: 0 (0.00 per sec .)

  18. General statistics:
  19. Total time: 5.8059 s
  20. Total number of events: 1948
  21. Total time taken by event execution: 172.0538 s
  22. Response time:
  23. Min: 0.90 ms
  24. Avg: 88.32 ms
  25. Max: 2885.69 ms
  26. Approx. 95 percentile: 80.01 ms

  27. Threads fairness:
  28. Events (avg/stddev): 60.8750/27.85
  29. Execution time (avg/stddev): 5.3767/0.29

Next, add the gin index.

  1. S = # create extension btree_gin;
  2. CREATE EXTENSION
  3. S = # create index idx3 on t_mps_info using gin (attract_tp, column_id, field2, status );
  4. CREATE INDEX

After stress testing, the qps measured is 5412, 85 times that of MySQL (5412/64 = 85 ).

  1. [Root @ rh6375Gt20150507 ~] # Sysbench -- db-driver = pgsql -- test =/opt/sysbench-0.5/sysbench/tests/db/qx. lua -- pgsql-db = S -- pgsql-user = postgres -- pgsql-password = S -- pgsql-host = srdsdevapp69 -- num-threads = 32 -- max-time = 5 run
  2. Sysbench 0.5: multi-threaded system evaluation benchmark

  3. Running the test with following options:
  4. Number of threads: 32
  5. Random number generator seed is 0 and will be ignored


  6. Threads

  7. OLTP test statistics:
  8. Queries saved Med:
  9. Read: 10000
  10. Write: 0
  11. Other: 0
  12. Total: 10000
  13. Transactions: 0 (0.00 per sec .)
  14. Read/write requests: 10000 (5412.80 per sec .)
  15. Other operations: 0 (0.00 per sec .)
  16. Ignored errors: 0 (0.00 per sec .)
  17. Reconnects: 0 (0.00 per sec .)

  18. General statistics:
  19. Total time: 1.8475 s
  20. Total number of events: 10000
  21. Total time taken by event execution: 58.2706 s
  22. Response time:
  23. Min: 0.95 ms
  24. Avg: 5.83 ms
  25. Max: 68.36 ms
  26. Approx. 95 percentile: 9.42 ms

  27. Threads fairness:
  28. Events (avg/stddev): 312.5000/47.80
  29. Execution time (avg/stddev): 1.8210/0.02

4.3 supplement as a comparison, we added a multi-field index containing the attract_tp, column_id, field2, and status fields on MySQL. the qps measured is more than 4000, which is still inferior to PostgreSQL. It can be seen that the simple query performance of MySQL widely spread in the industry is inferior to that of PostgreSQL! (For complex query of PostgreSQL, the performance is much better than that of MySQL. The SQL statements in my example cannot be complex queries, right ?)


5. Summary gin indexes (also including similar gist and spgist indexes) are a major feature of PostgreSQL. Based on these indexes, we can find a lot of interesting uses. If you are interested in the scenario mentioned in this article, you can compare it with the bitmap index of Oracle and the solution based on search engines (Elasticsearch, Solr, etc. In addition, I have limited knowledge. if there are other better solutions, I hope to let me know.



Remark 1. the question first explains what the "low-base multi-field combined query" in this title refers. Here is the content that meets the requirements...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.