Reading Notes-dremel

Source: Internet
Author: User

Author: Liu Xuhui Raymond reprinted. Please indicate the source

Email: colorant at 163.com

Blog: http://blog.csdn.net/colorant/

More paper Reading Note http://blog.csdn.net/colorant/article/details/8256145


Reading Notes-
Dremel: Interactive Analysis ofwebscaledatasets

 

Keywords

Column Storage

 

=
Target question =

Fast interactive ad-hoc query on large-scale sparse structured data

 

=
Core Idea =

 

First, dremel provides users with a query language similar to SQL syntax, but more importantly, it processes nested structured data.

 

  • Uses the column base storage scheme with a specific structure to represent a data unit with a triple, which contains the value of this data unit, and its hierarchy and repetition level in the nested data structure. Combined with the schema definition associated with the data table, you can efficiently express the actual storage of the data unit in the entire data structure. For details, see the relevant chapters of this paper.

 

  • In this storage solution, combined with the clever algorithm strip partial empty data, in the case of sparse data

    • Simpler Data Compression Algorithms
    • Reduces the amount of data actually stored.
    • For Column Structure Optimization search, quick search can be implemented without restructuring data.
    • When you retrieve only some fields, because on-demand reading (and clever algorithms can build a scanning process simpler than full table scanning) can greatly reduce the amount of data to be read.

 

Then there is the working model of the cluster node.

 

  • Serving tree layered Processing Model

    • Each leaf node only processes a small amount of data, and the results are summarized to the upper-level nodes in layers. A small number of hierarchical topologies can be effectively processed for a large number of input/output searches, A large number of hierarchical topologies are conducive to handling a large number of output situations.

 

In general, I personally understand that dremel can compare row
The reason for the rapid application of the base mapreduce mode is that, as for the reduction of data I/O volume, the hierarchical summary structure of a large number of leaf nodes is used to maximize the concurrent processing of data.

 

=
Implementation =

 

The execution of the query plan is not converted into a map-Reduce task, but by the serving
Tree models are processed by themselves.

 

In addition to clever algorithms for fast conversion of record and column-based storage models, there are also various optimizations, such as defining the sample method that allows partial results to be returned to end the query in advance, sacrifice proper precision for speed and so on

 

It is applicable to ad-hoc analysis of some data domains for sparse data.

 

=
Data =

 

The typical data in the paper test is between 20-TB, and a row (that is, a record) contains about 30-domains. When the query processing domain is in a single digit, the increase speed is between 3-10 times, which is basically equivalent to the ratio of the amount of data to be read except 2 (for example, the time spent reading 1/10 of the domain is 1/5 compared to the complete row storage structure)

 

When the domain to be processed is increased to a certain extent, the speed of the MR application of the row base will be reversed (according to paper, depending on the structure and sparsity of the specific table, usually dozens
Of) in the case of field, the speed comparison will flip)

 

=
Related Research, project, etc. =

 

Apache drill:
Http://incubator.apache.org/drill/ dremel Open Source implementation version, it seems that is still in the initial stage, planned a lot of functions, but the implementation is not much,
Currently, the core code is about 2 k lines of Java code.

 

Open dremel:
Merged to the drill Project

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.