Reproduced Google Big Data Engine Dremel anatomy (1)

Source: Internet
Author: User
Tags fsm repetition

Original: https://mp.weixin.qq.com/s?__biz=MjM5NzAyNTE0Ng==&mid=207895956&idx=1&sn= 58e8af26fd3c6025acfa5bc679d2ab01&scene=1&srcid=0919sz0sas6dnlhtl7gyxrgw&key= Dffc561732c2265121a47642e3bebf851225841a00d06325b09e7d125978a26d60870026c28e5375d5f6f3dd479d73bb&ascene=0 &uin=mjk1odmyntyymg%3d%3d&devicetype=imac+macbookpro11%2c4+osx+osx+10.10.5+build (14F27) &version= 11020201&pass_ticket=vww5l3fxkrxnygilafotg%2b5jk7t%2fjb5mz3fno4rau2kqv%2fwzs4byh3n6xczyy9nx

[Translator note] read a foreign classic technical paper from beginning to end! I believe this is something that many technology enthusiasts have been trying to do for a long time. The goal of this series is to meet the needs of the technical enthusiasts to get a glimpse of the original paper, and try to translate the text in full. There are some obscure academic statements in the original paper, and there may be passages that you are not interested in, so the translator will add "pre-reading", "Translator summary" and other links to help you read selectively, or help readers summarize. According to the translator's translation process, the paper also inevitably lacks the details of the derivation process (Google's genius always think we want to be as smart as they are), and then add a special "translator yy" link, according to the translator's understanding of the more complex content interpretation and analysis, this part of the subjective is very big inevitably wrong, hope that the reader correction. All non-original content is displayed in blue text.

Don't say much nonsense, everyone at a glance for fast!

"Translator Pre-reading" This is accompanied by the Dremel myth turned out of the original paper (do not know Dremel readers can immediately search to feel the power of Dremel). This paper deeply analyzes how Dremel uses ingenious data storage structure + distributed parallel computing to realize the myth of 1PB 3 seconds query.

The first part of the paper is "abstract", "Introduction", "Background", the introduction of the text more, its core meaning is: The face of massive data analysis and processing, MR (MapReduce) advantages do not need to say, Its disadvantage is that the timeliness of the poor do not meet the needs of interactive query, such as 3 seconds to complete the trillion data of a query, Dremel should be born of this demand, and Mr become effective complementary.

Summary

Dremel is a scalable, interactive Ad-hoc query system for analyzing read-only nested data. By combining multi-level tree execution and columnar data structures, it can accomplish aggregate queries over trillions of rows in seconds. This system scales to tens of thousands of CPUs and petabytes of data, and there are already thousands of users in Google. In this paper, we will describe the architecture and implementation of Dremel, explaining why it is a powerful complement to the MapReduce computation. We provide a columnar storage structure with nested records and discuss experiments on systems with thousands of nodes.

1. Introduction

Large-scale analytical data processing has become more widespread in both Internet companies and the entire industry, especially since it is now possible to collect and store massive amounts of critical business data with inexpensive storage. How to make it easier for analysts and engineers to take advantage of this data is also becoming increasingly important; in the data exploration, monitoring, online user support, rapid prototyping, data pipeline debugging and other tasks, the response time of the interaction will generally create an essential difference.
Performing large-scale interactive data analysis requires a high level of parallel computing power. For example, if you use a normal hard drive and want to read 1TB compressed data in 1 seconds, you need thousands of hard disks. Similarly, CPU-intensive query operations also need to run on thousands of cores. In Google, a lot of parallel computing is done using a shared cluster of ordinary PCs [5]. A cluster typically deploys distributed applications with a large number of shared resources, each producing different loads and running on machines of different hardware configurations. A single work task for a distributed application may take more time than other tasks, or it may never be completed due to a failure or being superseded by a cluster management system. Therefore, dealing with anomalies and failures is an important factor for fast execution and fault tolerance [10].
data in the Internet and scientific computing are often independent and unrelated to each other. Therefore, it is necessary to have a flexible data model in these areas. Data structures used in programming languages, messages exchanged between distributed systems, structured documents, and so on, can be described in a natural way using nested expressions. Normalizing and re-assembling these Internet-scale data is often costly. The nested data model is the basis for most structured data processing in Google [21] and is reportedly used by other Internet companies as well.
This paper describes a system called Dremel that enables interactive querying of a large data set on a shared cluster consisting of a common PC. Unlike a traditional database, it is capable of manipulating in-situ nesting data. In situ means the ability to access data in the right place, for example, in a distributed file system (such as gfs[14]) or other storage tiers (such as bigtable[8]). Querying these data typically requires a series of MapReduce (Mr[12]) tasks, while the Dremel can execute a lot at the same time, and execution times are much smaller than MR. Dremel is not intended to be an alternative to Mr, but is often used in conjunction with it to analyze the output of the Mr Pipeline or to create a prototype system for large scale computations. The
Dremel has been in production since 2006 and has thousands of users in Google. A variety of Dremel instances are deployed in the company, with thousands of nodes lined up. Examples of using this system include the following:

    • Analyzing Network Documents

    • Track installation data for Android Market apps

    • Crash report analysis for Google products

    • OCR results for Google Books

    • Junk e-mail analysis

    • Map widget debugging in Google maps

    • Tablet migration in a managed bigtable instance

    • Analysis of test results in Google distributed building system

    • Disk IO statistics for hundreds of hard disks

    • Resource monitoring for tasks running on Google data center

    • Symbol and dependency analysis for Google code base

Dremel is based on the concept of Internet search and parallel DBMS. First, its architecture draws on the concept of a service tree used in distributed search engines [11]. Just like a web search request, a query request is pushed into this tree and overridden at each step. The final result of the query is constantly assembled by aggregating the replies received from the underlying tree nodes. Second, Dremel provides a high-level, class-SQL language to express Ad-hoc queries. Unlike pig[18] and hive[16], it uses its own technology to execute queries instead of translating to Mr Tasks.

Finally, and most importantly, Dremel uses a column-striped storage structure that allows it to read less data from level two storage and reduce CPU consumption through cheaper compression. Columnstore has been used to analyze relational data [1], but as we know it has not been generalized to nested data models. The columnar storage format we have shown is supported by many data processing tools in Google, including Mr, Sawzall[20], and flumejava[7].

In this paper, we make the following contributions:

    • We describe a columnar storage format for nested data. It also provides algorithms for dissecting nested records into column structures and re-assembling them at query time (4th chapter).

    • We describe the query language and execution process of Dremel. Both are custom designed to perform efficiently on column-striped nested data without the need to load the original nesting records (Section 5).

    • We show how the tree-like execution process used in the Web search system is applied to database processing, explains their merits, and how to efficiently aggregate queries (section 6).

    • We experimented on a trillions of records, terabytes of data sets, and the system instance had 1000-4000 nodes [Section 7].

This paper is structured as follows. In Chapter 2, we explain how Dremel data analysis in conjunction with other data management tools. Its data model is described in Chapter 3. The above major contributions are covered in chapter 4-8. The relevant work is discussed in Chapter 9. Chapter 10 is a summary.

2. Background

As a starting point, let's look at a scenario that illustrates the need for interactive query processing and how it locates on the data management ecosystem. Suppose a Google employee Alice, born with a novel inspiration, wants to extract new types of signals from Web pages. She runs an Mr Task, analyzes the input data and then generates this signals data collection that stores the billions of records on the Distributed file system. To analyze the results of her experiment, she started Dremel and then executed several interactive commands:

DEFINE TABLE T as/path/to/data/*
SELECT TOP (Signal1, +), COUNT (*) from T

Her command was executed in just a few seconds. She also ran several other queries to verify that her algorithm was correct. She found an unexpected situation in Signal1 and wrote a flumejava[7.] The program performs a more complex analytic calculation. Once the problem is solved, she builds a pipeline that continuously processes the input data. Then she wrote some SQL queries to cross-dimension the output of the aggregated pipeline, and then added them to an interactive dashboard, where other engineers could locate and query it very quickly.
The above cases require interoperability between the query processor and other data management tools. The first component is a common storage tier. Google File System (gfs[14]) is a widely used distributed storage tier in the company. GFS uses redundant replication to protect data from hard disk failures, even if the stragglers can achieve fast response times. For in-situ data management, a high-performance storage layer is very important. It allows access to data without consuming too much time during the load phase. This requirement also causes the database to be less used in analytical data processing [13]. Another benefit is the ability to use standard tools to manipulate data easily in a file system, such as migrating to another cluster, changing access permissions, or defining a subset of data based on the file name.

The second element for building a collaborative data management component is a shared storage format. Columnar storage has proven that it works with flat relational data, but making it work for Google needs to fit into a nested data model. Figure 1 shows the main idea: a nested field such as A.B.C, whose values are stored continuously. Therefore, when A.B.C is read, no need to read A.E, A.B.D, and so on. The challenge is to protect all structured information and to reconstruct records by any subset of the fields. Next we discuss the data model, then the algorithm and query processing.

The first few parts of the "translator's summary" are actually the nested data in Figure 1 and the columnar storage format (columnar representatin of nested data). This is the core theory of the performance of Dremel, and the author does not emphasize this figure, in fact, the right of the column-like storage in Figure 1 is the key to conquer this paper.

3. Data Model

"Translator pre-reading" experienced programmers know that the first step in understanding a system is to understand its data model, so this chapter can be called one of the core parts of the paper. Its mathematical formula for the general coder is not very intuitive, but it is not complex, as described in 2, the structure, in essence, and JSON, XML description of the data structure is not different, is a nested, customized data structure. It is essential to understand several nouns and basics that will be used frequently in the following chapters. such as record, field, column, and so on. A record is a complete set of nested data, if a record is a row of data in db. Fields and columns refer to the same concept in most cases, more than 2 in name, language, and so on, which are a field in the structure, and are stored in the future as Columns (column). For example, a crawler in Google's Web page (document) data is a record, and the structure of its forward link, url is a field (or column). The so-called columnar storage is the fact that the original records are segmented by fields, and the data for each field is stored independently (such as storing the values of the Name.url column in all records). It is also important to note that the type of the field, each of which is of a certain type, such as required, indicates that there is only one value; optional, which represents an optional, 0 to 1 value, repeated (*), represents a repetition, 0 to n values, and so on. The repeated and optional types are very important, and the authors will abstract some important concepts from them so that the original data can be described without compromising at the least cost. Finally, we need to add two terms, one is column-stripe, which represents a list of column values (column "Bars", one row of data stored sequentially in a column) on the right side of Figure 1, and the other is a widely used path expression in the paper. XXX.XXX.XXX, which acts like an XPath in XML, such as Name.Language.Code, represents the Code field in Figure 2, because in a tree structure, such a path can accurately describe its location.

In this section we describe the data model of Dremel and some of the terminology that will be used later. This data model, often faced in distributed systems (' Protocol buffers ' [21]), is widely used in Google and provides an open source implementation. This data model is based on strongly typed nested records. Its abstract syntax is:

π= Dom | <A1: π[*|?],..., an: π[*|?] >

π is an atomic type (an int, a string ...). such as docid) or record type (point to a sub-structure, such as name). In the DOM, the subtype contains integers, floating-point numbers, strings, and so on. Records are made up of one or more fields. The field I is named Ai in a record, and a label (such as (?) or (*) to indicate that the field is optional or duplicated ... )。 A repeating field (*) indicates that a list of multiple values may appear more than once in a record, and the order in which the fields appear is very important. Optional field (?) May not appear in the record. If it is not a repeating field and is not an optional field, the field must have a value in the record, with only one.

Figure 2 shows an example. It describes a schema called document, which represents a Web page. The schema definition uses the specific syntax described in [21]. A Web page document must have an integral type docid and an optional links property that contains a list of forward and Backword, and each item in the list represents the docid of the other page. A Web page can have multiple name names, which represent different URLs. The name contains a series of code and (optionally) a combination of country (that is, language). Figure 2 also shows two sample records, R1 and R2, that follow the above schema. We will use these sample records to explain the algorithm that is involved in the next section. The field definition for the schema is in tree-like hierarchy. The full path of a nested field is represented by a simple dotted symbol, such as Name.Language.Code.
The nested data model lays out a platform-independent extensible mechanism for Google's serialized, structured data. And there are code generation tools for C + +, Java and other languages. By using the standard binary on-the-wire structure, cross-language interoperability is achieved, and field values are displayed sequentially in the order in which they appear in the record. In this way, a Java-written Mr Program can take advantage of a data source exposed by a C + + library. Therefore, if records are stored in a columnar structure, rapid assembly becomes an important factor in the interoperability between Mr and other data processing tools.

4. Nested Columnar storage

As shown in 1, our goal is to continuously store all the values of a field to improve retrieval efficiency. In this chapter, we face the following challenges: a non-destructive representation of a columnar format record (Section 4.1), a quick Encoding (section 4.2), an efficient recording assembly (section 4.3).

4.1 Repeat depth, define depth

Only field values cannot express the structure of the record. Given the two values of a repeating field, we do not know if this value is repeated by what ' depth ' (for example, whether the values are from two different records, or two duplicate values in the same record). Similarly, given a missing optional field, we do not know how many fields in the entire path are displayed defined. So we will introduce the concept of repetition depth and definition depth. Figure 3 summarizes the repetition and definition depth for all atomic fields for reference.

"Translator note" Readers please re-examine the 11 right of the columnar storage structure, which is the goal of Dremel, it is to change the nested structure of document in Figure 2 into a columnar storage structure. There are a variety of ways to achieve this, and in this section Dremel confidently introduces its most optimized, cost-effective, and efficient approach to design, and introduces two new concepts, repeating depth and defining depth. Because Dremel will fragment the records, and then the column of their own centralized storage, this will inevitably lead to data distortion, than 2, we put R1 and R2 URL column values together to get ["Http://A", "Http://B", "Http://C"], How do you know which record they belong to, which name belongs to the record ... The two depth concepts presented here are actually designed to solve this distortion problem and to achieve lossless expression.

"Translator yy" in the translation of the above paragraph of the translator is very abrupt, the original author tried to pose a problem to trigger the reader's thinking (only a naked field value how to figure out how it belongs to the record and structure), but people like me, read here in the mind is a few more superficial questions. Although it is mentioned in the above paper that "columnar storage has proved that it is suitable for flat relational data", "Dremel want to store all values sequentially in a field to improve retrieval efficiency", but it is a piece of tape, without detailing why this can improve retrieval efficiency? There are a variety of ways to improve the efficiency of search, is it not the only, best way to do so? Dremel How did the author come to think of this approach (don't tell me it's a one go?) The reason why the author omits it should be that there are other papers that have proved and deduced the advantages and the birth process of columnar storage. But in this article directly facing the details of the issue on the table, leading to two unfamiliar depth concept, could not help but a little abrupt, confusing. Here will first retain these confusion, literal translation, the end of this section of the translator yy link, the translator will try to have the same confused with the reader, yy a bit of the mystery.

repeat depth . Note The Code field in Figure 2. You can see that it appeared 3 times in the R1. The ' en-us ', in the first name, and ' EN-GB ' in the third name. Combined with Figure 2 you can certainly understand my last words and know the "en-us", "en", "EN-GB" appear in the R1 of the specific position, but do not look at the picture? How to use words, or a definition, a property, a value, to explain where they appear? This is the function of the concept of repetition depth, which can be used to tell us what repeating field is in the path, and this value is repeated to determine the position of this value (note that the repetition here is, in particular, "repetition" appearing under a field of a repeated type). We use depth 0 to denote the beginning of a record (the virtual root node), and the depth calculation ignores the non-repeating fields (the fields labeled repeated are not in depth). So in Name.Language.Code this path, contains two repeating fields, name and Language, if repeat at name, repeat depth is 1 (virtual root node is 0, next level is 1), repeat at Language is 2, it is impossible to repeat in code, it is Requir The type of Ed, which indicates that there is only one; Similarly, in Path Links.forward, links are optional, do not participate in depth calculations (it is not possible to repeat), forward is repeated, so the repeat depth is 1 only if the forward is repeated. Now we scan record R1 from top to bottom. When we encounter ' en-us ', we don't see any repeating fields, that is, the repeat depth is 0. When we encounter ' en ', the field language repeats (a language has already appeared in the ' en-us ' path), so the repeat depth is 2. Eventually, when we encounter ' EN-GB ', name repeats (name in front ' en-us ' and ' En ' path has already appeared once, and this name after language only once, no duplicates, so the repeat depth is 1. Therefore, the value of code in R1 has a repeating depth of 0, 2, and 1.

The depth of the "translator note" Tree is very well understood, the root node is 0, the next level is 1, then the next level is 2, and so on. But the repetition depth is different, it skip all non-repeated types of fields, that is, only the repeated type can be counted as the first level depth. The reason for this is that in the case of a known schema, for the value of the repeat depth, only the repeated type of participation is sufficient (enough for the following split and assembly algorithms), it is not necessary to follow the full schema tree to calculate the depth value.

Note that the second name does not contain any code values in R1. In order to determine that ' EN-GB ' appears in the third name instead of the second one, we add a null value between ' en ' and ' EN-GB ' (3). The Code field in the Language field is a mandatory value, so it is missing meaning that language is not defined either. In general, determining which fields in a path are explicitly defined requires some additional information, that is, the depth of the definition that is described next.

defines the depth . Each value of a field in path p, especially null, has a defined depth that shows how many optional fields in P actually have values. For example, we see that R1 does not have a backward link, and the link field is defined (at depth 1). To protect this information, we add a null value to the Links.backward column and set its definition depth to 1. Similarly, in R2, the Name.Language.Country defines a depth of 1, while in R1, 2 (' en ') and 1 (' Http://B ') respectively.

Define depth using integral type instead of simple is-null bits, so that the leaf node data (for example, Name.Language.Country) can contain enough information to indicate what happens to its parent node, and in Chapter 4.3, specific examples of using that information are given.

The definition depth of "translator's note" is in a sense a service to the repetition depth. In fact, there is a very important theory introduced in the paper is not very obvious, just simple with sequentially, contiguously such words. The theory is that in the columnar store on the right side of Figure 1, all the columns are stored r1, then stored R2, which means that the order of the records stored is consistent for all columns. This sequence is like a unique primary key that all column values contain, which logically strings the dismembered column values together, knowing that they belong to the same record, which is an important means of ensuring that the records are not distorted after being split. Since the order is a very necessary factor that cannot be distorted, it cannot be skipped easily when the value of a column of a record is empty, and it must be explicitly stored with a null value to ensure that the record order is valid. The null value itself can not be interpreted enough information, such as a record in a Name.Language.Country column empty, which may indicate that country has no value (such as ' en '), may also indicate that the Language has no value (such as ' HTTP/ B '), the two cases in the assembly algorithm is to distinguish between processing, can not be distorted, so it is necessary to draw out the definition of depth, can accurately describe this information.

The encoding mentioned above ensures that the structure of the record is lossless. This is a good understanding, there is not too much to introduce the proof process.

Encoding (encoded). each column is stored as a collection of blocks. Each block contains a repeating depth and definition depth (hereinafter referred to as depth) and contains the field values. Nulls are not explicitly stored because they can be determined by defining the depth: any sum that defines a depth less than the number of repetitions and optional fields means a null. The value of the required field does not require storage definition depth. Similarly, the repetition depth is stored only when necessary, for example, the definition of depth 0 implies a repetition depth of 0, so the latter can be omitted. In fact, in Figure 3, there is no depth for docid storage. The depth is packaged as a bit sequence. We only use the required bits, for example, if the maximum definition depth is 3, we only need to use 2 bit.

"Translator Summary" here also mentions the block and other concepts, in fact, the paper should be in short--in Figure 3, the many similar to the "table" structure (looks like a table, temporarily called "table", harmless), a "table" is a column-stripe, is the Block collection, each row in the table is a block, which is the columnar storage shown on the right side of Figure 1. Physically like a separate "table", and logically can be shown in Figure 1, the right side of the tree-like structure. In the later section of the Assembly state machine algorithm, the reader can have a deeper understanding of this

"Translator yy" after reading the original chapter, here yy the above mentioned all the confusion

We all know that any technical solution is not utopian, it must be because of some of the pain caused by the optimization of the derived. The translator tries to dremel the deduction process:

Step1. First, regardless of any performance, optimization, or distributed environment, just want to implement the function, the most direct way is to record storage, such as a crawler grabbed a document (2 R1, r2), directly stored in a GFS file. read out the necessary files at query time, parse into structured data, and query the results. This is definitely a function, but we won't do it because it's a very weak disadvantage-I just need to read the Name.url information in the R1, and here we need to read the entire record, and the useless data is far more than the valid data, which is a huge waste of performance. (In fact, it is the emphasis of the paper on record-based storage disadvantage)

Step2. Failure in the first step the data is unstructured at the time of storage (unstructured in storage means that the entire piece of data needs to be read and parsed into structured data in memory). The current optimization goal is to make the data in the storage medium structured (so that only the necessary data can be read by structure). Do not think too far, the most classic structured storage is a well-known relational database, its tables, columns, rows, associations and other concepts sufficient to be stored in accordance with the implementation of data structure, but also can do lossless. For nested data, the relational database also has a design table structure of the formula (in fact, is a series of one-to-many table structure), to Name.Language.Country such a path for example, on three tables, Name, Language, Country, The three tables contain their own internal required fields, and the foreign keys that contain the parent table embody a one-to-many association (the country table contains language_id,language tables that contain name_id). Such an old-fashioned design can actually achieve an important goal of Dremel-reading only the necessary columns. Query to statistics country, only need to traverse the country table, if also to statistics Language field, that is language+country two table join query, Not a waste of time (to know that the Dremel query process for a column-stripe traversal is not escaped, it is equivalent to traverse a table here).

The second step of yy is a bit unreliable, but it is not irrelevant, if not consider the generality, do not consider how disgusting to build a nested structure (in fact, the use of dynamic schema to translate a nested structure into a relational table is not difficult), and do not consider the cool, why not do it? But the answer is still not, for two reasons. Language+country This example is too simple, if it is name+country statistics (such as the statistics country is the "xxx" the name of how many), the problem is obvious, in addition to traversing the name, Country table, It is also necessary to refer to the Language table (only the associated Language from the country table, which requires a three-table join query Name+language+country to get the result), which violates the goal of Dremel (only the required tables). Changing the table design can solve the problem-adding a foreign key association to name in the Country table. Then continue to the extreme situation, if there is a layer above the name? Country, there's another layer underneath? Are these layers likely to join queries? Eventually you will find that in this direction you need to add all the foreign keys to all the ancestors ' tables in all the table. Not only that, as mentioned above to avoid distortion Dremel using sequential storage, the order of the equivalent of a record of the primary key, all the column values to include it, it means that in this scenario, the table will be added record_id this foreign key. That's enough to look straight (a lot of space is wasted on redundant foreign key storage). The second reason is very simple, even if the dynamic schema, dynamic control table structure, is not general enough, more difficult to expand, not suitable for the general data analysis platform.

Step3. After facing the second step of the tangled, we found that in front of the puzzle is actually 2, one is to solve each table on the possible endless foreign keys, and the second is that the solution is enough to generalize. Looking back at Dremel's final plan, you might find that it was a second step to make a two genius improvement: first, the Three Musketeers replace all foreign keys with the repeat depth + definition depth + order; Second, the table design without distinguishing between required, repeated and other types, non-discriminatory, are designed for three columns with field values + repeat depth + definition depth. For the first improvement, I can only say that the three Musketeers are really artifacts, they are enough for any two "table" Data to establish an association relationship (how do you see the state machine algorithm in the 4.3 below), enough to replace the complex foreign key; for the second improvement, it is also a compromise to support common structures and algorithms, and more than once in the paper, the two depths are not required for all fields, such as R and D of the DocId field are always 0 (If you design a table in a relational database DocId only as a column of a table instead of becoming a table independently), all fields are non-null defined depths are always equal, and some of these waste is for the cost of generalization, but the problem is small, as long as the storage, You can try to avoid wasting your hands and feet when calculating (the encoding section above mentions how to tamper with it).

The derivation of the above 3 step seems to be unstructured, in fact, is logically close, representing the translator such a common coder in order to achieve a goal and constantly reflect on the optimization process, and no jumping thinking, in addition to the two improvements in Step3, not the translator's yy level can be mastered. I also speculate that Google's genius to think of such a solution may be based on two routes: first, the data analysis and calculation process of high-level abstraction, the establishment of a mathematical model, to help the derivation process (the advantage of mathematics problem is that it is in most cases is solved) The other is to avoid storage record_id, avoid dealing with complex foreign key associations, the idea of sequential storage, sequential traversal, through the "order" of the word on the two words (in each column-stripe, the record is in a fixed order, the record can also be in accordance with the upper and lower fixed order , scan "order" to the extreme), deduce the 4.3 state machine algorithm of the approximate process, leaving the last problem-there is only one field value, there is no foreign key (association information), just know that it and other field values are stored in strict order, How do you know which record it belongs to and exactly where it is in the record? In response to this problem, the concept of repetition depth and definition depth was deduced (at the beginning of 4.1, the author put forward the last difficulty before him to draw the reader into the play-"only the field values do not express the structure of the record clearly ... Are these values from two different records, or are two duplicate values in the same record? ......”)。 But for the step-by-step, can't accept the jumping thinking of the translator, or hope that the paper can be described in detail before the final problem of the derivation process-why to follow the column structure, why the records disassembled so fragmented, non-destructive representation of the method there are many reasons to choose this kind of ... Therefore, as above yy, only for the reader reference and correction. In addition, it has been mentioned in the paper that "columnar storage has proved that it is suitable for flat relational data", which is why the translator will associate the problem that is based on the relational database to deduce.

4.2 Split record for columnar storage

Above we demonstrate the use of columnar format to express the record structure and to encoding. The next challenge we face is how to efficiently manufacture Column-stripe and repeat and define depth. The algorithm for calculating the basis of repetition and definition depth is given in Appendix A. The algorithm traverses the record structure and calculates the depth of each column value, which is null. At Google, there is often a schema that contains thousands of fields, but only hundreds of are used in the record. Therefore, we need to deal with missing fields as cheaply as possible. To make column-stripe, we create a tree structure with the node as the field writer, whose structure matches the level of the field in the schema. The underlying idea is to perform updates only when the field writer has its own data, without trying to pass the parent node state down, unless absolutely necessary. The child node writer inherits the depth value of the parent node. When any value is added, a child writer synchronizes the depth value to the parent node.

4.3 Recording assembly

"Translator pre-reading" When traversing Column-stripe, is in front of the naked field values (such as ' en ') and two int (repeat depth, definition depth), without any associated information, how to know which record it belongs to? Where is it in the record? This is the problem that this section of the state machine algorithm solves. The translator believes that the core of this algorithm is the word "order", in the absence of any associated information, the record is stored in the order is the primary key, in the record by the upper and lower field order is the position, and two int is the only clue of the Order of judgment.

It is important to have an efficient assembly record from columnar data. To get a subset of the fields, our goal is to reorganize the original records as if they contain only the selected fields, and the other columns do not exist. The core idea is that we create a finite state machine (FSM) for each field, read the field values and depths, and then sequentially add the values to the output. The FSM state of a field corresponds to the reader for this field. Repeat depth-driven state transitions. Once a reader gets a value, we'll look at the repeat depth of the next value to determine how the state changes and which reader to jump to. An FSM state change is always a record of the entire process of assembly.

Figure 4 illustrates a process by which an FSM reorganizes a complete record, taking document as an example. The start state is docid. Once a docid value is read, the FSM transitions to links.backward. Gets the value of all repeating field backward, the FSM jumps to Links.forward, and so on. The recording assembly algorithm details in Appendix B.

"Translator note" due to the existence of Appendix B (the original paper on the core algorithm is attached to the source code and interpretation, can be consulted in the original text), the introduction of the state jump is too simple, so a little to add. First of all, to determine the 3 ideas: first, all the data are stored in the form of a "table" similar to the one in Figure 3, the second, the algorithm will be combined with the schema, in a certain order to read some "table" (not all, such as only statistical forward that will only read this "table"), The order is not fixed, which is the process of state change in the state machine; Thirdly, regardless of the order, it is continuously circulating in the order of record (for example, the current data is stored in order r1,r2,r3 ...). That will go into the first loop to read and assemble the R1, and the second loop to assemble the R2 ... ), a loop is the life cycle of a state machine from start to finish.

Through the above three points of thinking, you can think of the scanning process need to constantly do a very important thing-scan to a certain table of a row to determine whether this line is the next record, if so, in order to continue to fill the current record, you need to jump to the next table to continue to scan another field value, Otherwise, the current record is assembled with the value of this line, so repeat until you need to jump out of the last table, at the end of a loop (a state machine ends, one record is assembled, and the Next loop is entered). Understanding this makes it possible to understand why a state machine is used to implement an algorithm, because the loop is the process of constant state judgment. To think about it, it can be thought that this judgment is not simply "whether it belongs to the next record", and for the descendant field of the repeated field, it is also necessary to determine whether it belongs to the ancestor of the next ancestor of the same record and to which level. As an example:

For example, a language of a name in a R1 is currently being assembled, Scan to a row of Name.Language.Country, if this row repeat depth is 0, indicating that the next record, indicating that the current name under the Language no longer repeat (all Language assembly of the current name), and then jump to name.url continue to assemble other properties; is 1, which represents the next name belonging to R1, and also indicates that the current name under language no longer repeats (all language of the current name are assembled), That also jumps to name.url; if it is 2, the next Language that belongs to the current name (the Language of the current name is not yet assembled), take a small loop and jump back to the previous Name.Language.Code to assemble the next Language of the current name.

The example can also give more, but it is important to abstract the nature of the state change from the example, the following paragraph is a brief description of the nature of the paper

The structure logic of the FSM can be expressed as: set R to the next repeating depth returned by the current field reader for field F. In the schema tree, we find its ancestor in depth R, and then select the first leaf field N of that ancestor node. This gives us an FSM state change (F;R)->n. For example, let R=1 be the next repetition depth read by F=name.language.country. Its ancestor repeat depth 1 is Name, and its first leaf field is N=name.url. The FSM assembly algorithm details are in Appendix C.

If only one subset of the fields needs to be processed, the FSM is simpler. Figure 5 depicts an FSM that reads fields DocId and Name.Language.Country. The output records S1 and S2 are shown in the figure. Note that our encoding and assembly algorithms protect the closed structure of the field country. This is important for the application access process, for example, Country appears in the first Language of the second Name, which in XPath can be accessed with this expression:/name[2]/language[1]/country.

(Continue below ...) )

Reproduced Google Big Data Engine Dremel anatomy (1)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.