Multidimensional Data Query Efficiency Analysis (2)

Source: Internet
Author: User

Last time, we analyzed the PostgreSQL query efficiency under the table structure design such as an additional attribute table. Because PostgreSQL is known as a "performance" problem, it is necessary to use MySQL with the MyISAM engine to try again. After analyzing two common open source databases in detail, the topic will be further extended to the Database Structure Stored by row and the impact of indexes on query efficiency. MySQL in the following experiment is the mariadb release version. Or start from Table creation:

Mariadb [veil]> show create table CC2;
+ ------- + Certificate ----------------------- +
| Table | create table |
+ ------- + Certificate ----------------------- +
| CC2 | create table 'cc2 '(
'Contact _ id' int (11)Not Null,
'Cad _ id' int (11) Not Null,
'Value' int (11) Not Null
) Engine = MyISAM default charset = Latin1 |
+ ------- + Certificate ----------------------- +
1 row in Set (0.00 Sec)

Mariadb [veil]> Select Count (*) from CC2;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 5904385 |
+ ---------- +
1 row in Set (0.00 Sec)

MySQL caches the results of Count (*) on the table, so it does not take time to query. Create necessary indexes again:

Mariadb [veil]> Create index contact_id_idxOnCC2 (contact_id );
Query OK, 5904385 rows affected (3.11 Sec)
Records: 5904385 duplicates: 0 Warnings: 0

Mariadb [veil]> Create index cad_id_value_idxOnCC2 (cad_id, value );
Query OK, 5904385 rows affected (8.17 Sec)
Records: 5904385 duplicates: 0 Warnings: 0

Mariadb [veil]>SelectCount (*) from CC2 where cad_id = 101 and value = 5;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 998839 |
+ ---------- +
1 row inSet(0.18 Sec)

We can see that this speed is just the same. Compared with the same query in PostgreSQL, the speed is 4 times faster. However:

Mariadb [veil]>SelectCount (contact_id) from CC2 where cad_id = 101 and value = 5;
+ ------------------- +
| Count (contact_id) |
+ ------------------- +
| 1, 998839 |
+ ------------------- +
1 row inSet(0.41 Sec)

This query is slower than the previous one because count (*) is the rowid of the row returned by the number, while count (contact_id) is the real contact_id column of the number, the value of this column is not included in cad_id_value_idx. If you create more indexes, the speed will be faster:

Mariadb [veil]> Create index cad_id_value_contact_id_idxOnCC2 (cad_id, value, contact_id );
Query OK, 5904385 rows affected (13.37 Sec)
Records: 5904385 duplicates: 0 Warnings: 0

Mariadb [veil]>SelectCount (contact_id) from CC2 where cad_id = 101 and value = 5;
+ ------------------- +
| Count (contact_id) |
+ ------------------- +
| 1, 998839 |
+ ------------------- +
1 row inSet(0.21 Sec)

This is called Covering Index in MySQL. Probably PostgreSQL 9.2 hasn't released index only query. That's what I guess.
Now, let's see if MySQL is better than PostgreSQL under two conditions. First try intersect:

Mariadb [veil]>SelectCount (*) from (SelectContact_id from CC as A1 where a1.cad _ id = 101 and a1.value = 5 intersectSelectContact_id from CC as A2 where a2.cad _ id = 102 and a2.value = 7) as temp;
Error 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL Server versionForThe right syntaxUseNear 'intersectSelectContact_id from CC as A2 where a2.cad _ id = 102 and a2.value = 7' at line 1

MySQL does not support intersect. As we know that in + subquery must be slower, only the inner join statement is left.

Mariadb [veil]>SelectCount (*) from CC as A1, cc as A2 where a1.contact _ id = a2.contact _ id and a1.cad _ id = 101 and a1.value = 5 and a2.cad _ id = 102 and a2.value = 7;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 164788 |
+ ---------- +
1 row inSet(6.56 Sec)

Mariadb [veil]>SelectCount (a1.contact _ id) from CC as A1, cc as A2 where a1.contact _ id = a2.contact _ id and a1.cad _ id = 101 and a1.value = 5 and a2.cad _ id = 102 and a2.value = 7;
+ ---------------------- +
| Count (a1.contact _ id) |
+ ---------------------- +
| 1, 164788 |
+ ---------------------- +
1 row inSet(6.67 Sec)

You are not mistaken. This result is so miserable. It is even slower than using in + subquery in PostgreSQL. Let's see what the situation is:

Mariadb [veil]> explain extended Select Count (a1.contact _ id) from CC as A1, cc as A2 where a1.contact _ id = a2.contact _ id and a1.cad _ id = 101 and a1.value = 5 and a2.cad _ id = 102 and a2.value = 7;
+ ---- + ------------- + ------- + ------ + ------------------------- + --------- + -------------------------------- + -------- + ------------ + -------------------------- +
| ID | select_type | table | Type | Possible_keys | key | key_len | ref | rows | filtered | extra |
+ ---- + ------------- + ------- + ------ + ------------------------- + --------- + -------------------------------- + -------- + ------------ + -------------------------- +
| 1 | simple | A1 | ref | contact_id, cad_id_value_contact_id | 10 | const, const | 808542 | 100.00 | using where; Using index |
| 1 | simple | A2 | ref | contact_id, cad_id_value_contact_id | 15 | const, const, veil. a1.contact _ id | 1 | 100.00 | using index |
+ ---- + ------------- + ------- + ------ + ------------------------- + --------- + -------------------------------- + -------- + ------------ + -------------------------- +
2 rows in Set , 1 warning (0.00 Sec)

Although not as detailed as PostgreSQL's analysis results. But at least we can see that the index is actually being used. So I feel that MySQL is really not good at complicated join operations. This is the problem. The Join Processing is not good, and it cannot be supported in the case of multiple conditions.

For our first attempt, we can summarize the following points:

    1. The disk-based solution does not work and must be stored in the memory.
    2. Unconditional: PostgreSQL is slow, MySQL is slow
    3. One condition: PostgreSQL is faster and MySQL is faster (thanks to Covering Index)
    4. Two conditions: PostgreSQL is slow and MySQL is slower. Inner join is a little slower than intersect.
    5. Three conditions: PostgreSQL is faster than MySQL, and inner join is faster than intersect.

In general, it is quite difficult to control the request response time within one second. Therefore, the idea of modeling must be changed. Let's take a look at whether it is better to change the rows and columns:

Mariadb [veil]> show create table spike;
+ ------- + Certificate ------------------------------------------------------- +
| Table | create table |
+ ------- + Certificate ------------------------------------------------------- +
| Spike | create table 'Spike '(
'Contact _ id' int (11) Not Null,
'A1' int (11) default null,
'A2 'int (11) default null,
'A3 'int (11) default null
) Engine = MyISAM default charset = Latin1 |
+ ------- + Certificate ------------------------------------------------------- +
1 row in Set (0.00 Sec)

Mariadb [veil]> Select * From spike limit 10;
+ ------------ + ------ +
| Contact_id | A1 | A2 | A3 |
+ ------------ + ------ +
| 800001 | 4 | null |
| 800003 | 5 | 7 | 1 |
| 800004 | 3 | null |
| 800005 | 3 | null |
| 800006 | 5 | null |
| 800007 | null | 6 | null |
| 800008 | 5 | null |
| 800009 | 4 | 8 | null |
| 800011 | null | 7 | null |
| 800012 | 3 | 6 | 1 |
+ ------------ + ------ +
10 rows in Set (0.01 Sec)

Column A1 comes from the cad_id = 101 value, A2 comes from the cad_id = 102 value, A3 comes from the cad_id = 6 value.

First try the performance of a condition:

 
Mariadb [veil]>SelectCount (*) from spike where a1 = 5;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 998839 |
+ ---------- +
1 row inSet(0.26 Sec)

The speed is good, even if we haven't added the A1 index yet. This is because the performance of a row-based database such as PostgreSQL and MySQL depends largely on the row size when Sequential Scan is required. If we had 100 rows, the performance would certainly not be so good. If you don't believe it, try:

Mariadb [veil]> alter table spike add column dummy1 text;
Query OK, 4074980 rows affected (1.28 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]>SelectCount (*) from spike where a1 = 5;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 998839 |
+ ---------- +
1 row inSet(0.64 Sec)

Unexpectedly, just adding a text field slows down the query. To overcome the problem that the query speed of Row-based storage is accompanied by an increase in the number of columns and the query speed is getting slower and slower, the index becomes particularly necessary:

 
Mariadb [veil]> Create index a1_idxOnSpike (A1 );
Query OK, 4074980 rows affected (4.97 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]>SelectCount (*) from spike where a1 = 5;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 998839 |
+ ---------- +
1 row inSet(0.22 Sec)

We can see that the query time becomes normal again. Because a1_idx is a Covering Index, Count does not need to really read rows. You only need to query the index to know how many rows there are. In addition, MyISAM is not like PostgreSQL, because MVCC also needs to update the hint bit for the original row, so the whole query has no relationship with the row size.

 
Mariadb [veil]> alter table spike add column dummy2 text;
Query OK, 4074980 rows affected (5.00 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]> alter table spike add column dummy3 text;
Query OK, 4074980 rows affected (5.30 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]>SelectCount (*) from spike where a1 = 5;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 998839 |
+ ---------- +
1 row inSet(0.23 Sec)

Right, don't lie to you.
What are the two conditions? First, return the table structure to the normal state and remove the indexes:

Mariadb [veil]> alter table spike drop column dummy1;
Query OK, 4074980 rows affected (5.19 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]> alter table spike drop column dummy2;
Query OK, 4074980 rows affected (4.98 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]> alter table spike drop column dummy3;
Query OK, 4074980 rows affected (3.41 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]> alter table spike drop index a1_idx;
Query OK, 4074980 rows affected (0.46 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]> show create table spike;
+ ------- + Certificate ------------------------------------------------------- +
| Table | create table |
+ ------- + Certificate ------------------------------------------------------- +
| Spike | create table 'Spike '(
'Contact _ id' int (11) Not Null,
'A1' int (11) default null,
'A2 'int (11) default null,
'A3 'int (11) default null
) Engine = MyISAM default charset = Latin1 |
+ ------- + Certificate ------------------------------------------------------- +
1 row in Set (0.00 Sec)

The following two conditions are used for query:

 
Mariadb [veil]>SelectCount (*) from spike where a1 = 5 and A2 = 7;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 164788 |
+ ---------- +
1 row inSet(0.30 Sec)

Good speed! Creating an index for A1. it should be faster:

Mariadb [veil]> Create index a1_idxOnSpike (A1 );
Query OK, 4074980 rows affected (3.06 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]>SelectCount (*) from spike where a1 = 5 and A2 = 7;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 164788 |
+ ---------- +
1 row inSet(0.45 Sec)

Nima is slowing down ...... The reason is not that the index is not used, but that it is slower because the index is used.

Mariadb [veil]> explainSelectCount (*) from spike where a1 = 5 and A2 = 7;
+ ---- + ------------- + ------- + ------ + --------------- + -------- + --------- + ------- + --------- + ------------- +
| ID | select_type | table |Type| Possible_keys | key | key_len | ref | rows | extra |
+ ---- + ------------- + ------- + ------ + --------------- + -------- + --------- + ------- + --------- + ------------- +
| 1 | simple | spike | ref | a1_idx | a1_idx | 5 | const | 1420338 | using where |
+ ---- + ------------- + ------- + ------ + --------------- + -------- + --------- + ------- + --------- + ------------- +
1 row inSet(0.01 Sec)

The A1 index is used, but A2 is not in the same index. Therefore, this is not a covering index, and the query cannot only be queried in the index, but also returns to the original row store to check a2 = 7.

Mariadb [veil]> Create index a2_idxOnSpike (A2 );
Query OK, 4074980 rows affected (5.71 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]>SelectCount (*) from spike where a1 = 5 and A2 = 7;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 164788 |
+ ---------- +
1 row inSet(0.40 Sec)

Creating a separate index for a2_idx does not help, because there is still no A2 when scanning a1_idx. The query still needs to return to the original row store to obtain A2.

Mariadb [veil]> Create index a1_a2_idxOnSpike (a1, a2 );
Query OK, 4074980 rows affected (9.47 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]>SelectCount (*) from spike where a1 = 5 and A2 = 7;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 164788 |
+ ---------- +
1 row inSet(0.06 Sec)

Now, we can see what is speed! It is completely index-based.

Mariadb [veil]> alter table spike add column dummy1 text;
Query OK, 4074980 rows affected (12.33 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]> alter table spike add column dummy2 text;
Query OK, 4074980 rows affected (12.87 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]> alter table spike add column dummy3 text;
Query OK, 4074980 rows affected (13.15 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]>SelectCount (*) from spike where a1 = 5 and A2 = 7;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 164788 |
+ ---------- +
1 row inSet(0.06 Sec)

It doesn't matter if you increase the size of a row, because you won't read the real row at all. However, once the Covering Index is removed, it will return to the original mode:

Mariadb [veil]> alter table spike drop index a1_a2_idx;
Query OK, 4074980 rows affected (8.74 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]> Select Count (*) from spike where a1 = 5 and A2 = 7;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 164788 |
+ ---------- +
1 row in Set (0.62 Sec)

Mariadb [veil]> alter table spike drop index a1_idx;
Query OK, 4074980 rows affected (5.31 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]> alter table spike drop index a2_idx;
Query OK, 4074980 rows affected (1.93 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]> Select Count (*) from spike where a1 = 5 and A2 = 7;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 164788 |
+ ---------- +
1 row in Set (0.72 Sec)

Therefore, Wide Table design is indeed a problem. Once the index cannot be covered, the situation is terrible.

If the Covering Index is so good, why not create a covering index for all queries? Can this problem be solved.

This is a really awesome idea. Try it:

Mariadb [veil]> Create index a1_a2_a3_idxOnSpike (A1, A2, A3 );
Query OK, 4074980 rows affected (7.03 Sec)
Records: 4074980 duplicates: 0 Warnings: 0

Mariadb [veil]>SelectCount (*) from spike where a1 = 5 and A2 = 7 and A3 = 1;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 6748 |
+ ---------- +
1 row inSet(0.01 Sec)

Really! Covering Index again. It seems that once such an index is created, all query problems are solved.

 
Mariadb [veil]>SelectCount (*) from spike where a1 = 5;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 998839 |
+ ---------- +
1 row inSet(0.28 Sec)

Yes, it does!

Mariadb [veil]>SelectCount (*) from spike where a3 = 1;
+ ---------- +
| Count (*) |
+ ---------- +
| 1, 207228 |
+ ---------- +
1 row inSet(1.06 Sec)

I am so happy. Planted. The reason is that the index order is A1, A2, A3. The query for A1 is used because it is consistent with the index order. Similar A1 and A2 queries can also be used. However, A3 and the order of creation are inconsistent and cannot be used. If we want to truly cover all the situations, it will be the result of an arrangement and combination. For a table with more than a dozen columns, this number is too large for the system to bear.

To sum up, no matter whether it is disk or memory, no matter whether it is an additional attribute table, or the attribute is modeled into a column, no way to provide responsive query whether it is PostgreSQL or MySQL. Traditional row-based databases cannot meet the high-speed Query Needs of multi-dimensional data.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.