Multidimensional Data Query Efficiency Analysis (3)

Source: Internet
Author: User

Shortcut:

Multidimensional Data Query Efficiency Analysis (1)
Multidimensional Data Query Efficiency Analysis (2)

We have analyzed the challenges of multi-dimensional data query in PostgreSQL and MySQL. The root cause is that when the row-based database becomes too large (Wide Table), the row size will be affected once the index cannot complete all the queries. To avoid the defects of Row-based storage, the database stored by column is invented. There are many databases with row-based storage, most of which are costly, and monetdb is open-source. With the same data volume and the same Wide Table design, monetdb can be much faster:

 
SQL>SelectCount (contact_id) from spike2 where a1 = 7;
+ -------- +
| L24 |
+ ========+
| 1, 830569 |
+ -------- +
1 tuple (24.576 ms)
SQL>SelectCount (contact_id) from spike2 where a1 = 7 and A2 = 5;
+ -------- +
| L25 |
+ ========+
| 1, 164788 |
+ -------- +
1 tuple (105.352 ms)
SQL>SelectCount (contact_id) from spike2 where a1 = 7 and A2 = 5 and A3 = 1;
+ ------ +
| L26 |
+ ======+
| 1, 4040 |
+ ------ +
1 tuple (24.963 ms)

In terms of results, the speed is generally 10 times faster, and in extreme cases, the speed is 100 times faster. In addition to row-based storage, monetdb also has the secret that it will be compiled into a small relational algebra operation by SQL. A simple relational algebra operation is like addition. This operation only adds two sets. Every "relational algebra operation" is optimized using C or even assembly. The implementation of a common row-based database is actually that there is a loop in the outer layer, the loop body is an interpreter, and the parsed SQL statements and row records are input. The monetdb compilation implementation avoids placing an interpreter in the loop. The advantage of doing so is that it facilitates the CPU to make branch prediction. Because the interpreter'sCodeCompared with hard-coded "relational algebra operations", it is much more complicated, and the failure rate of corresponding branch prediction is also much higher.

Monetdb also has obvious defects. Every relational algebra operation must calculate the results completely (materialized), that is, it must occupy the memory. For example, if you want to select A + B + C, the results of each row of A + B will be calculated and then the results will be added to C. The more complex the calculation, the more memory the middle calculation results need to be liked. The professors who wrote about monetdb obviously expected this problem and brought a graduate student to study how to make this computation not completely materialized. The research result is a project named x100. The doctoral student who wrote this project went to the company and named it vectorwise. Later, the company acquired actian again. This x100 project is very good. When it calculates A + B + C, it does not calculate all the rows of A + C, and then calculate + C, but is divided into several kb segments, concatenates A + B + C for computation. The advantage of this is that the CPU cache is used, and the computing result of A + B is used to calculate A + B + C without being removed from the CPU cache.

Advanced databases with row-based storage can handle insufficient memory and need to access disks. Monetdb does not work. It is completely based on memory ing files. Once swap is needed to the disk, the performance will be terrible.

How much data can be processed when the memory is fully used? We can estimate the size of a column:

5904385 integer => 23 m raw data
10518813 integer => 41 m raw data
10518813 date time/bigint => 71 m raw data
9531627 integer => 37 m raw data
9531627 varchar (128) => 37 m + 320 m raw data (320 m is the dictionary, it varies)

It is worth noting that null also occupies space.

Based on this data, tens of millions of data processing capabilities can still be achieved on servers with over 100 GB of memory. Transaction data for the last three months of the not-big database can still be processed.

Therefore, if the data volume to be processed does not exceed 100 GB, monetdb is worth a try. If the data volume is large, either choose commercial products such as vertica and vectorwise, or you have to make your own sharding. Another trend is that traditional row-based databases are also integrating the characteristics of column-based storage, and there are some mixed solutions, such as Pax. In general, there is no perfect, especially mature open-source solutions in this field. It may be because companies with such needs are rich owners. On the other hand, if the performance of common PostgreSQL and MySQL is not very picky (the 1 s clock limit set in this article is actually not practical, but how can I run an analysis task for a few minutes?), it is actually quite usable.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.