Tenth: Spark SQL Source Analysis In-memory Columnar storage Source Analysis query

Last Update:2017-09-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

/** Spark SQL Source Analysis series Article */

As mentioned earlier, the storage structure of spark SQL In-memory columnar storage is based on column storage.

So based on the above storage structure, we query how the cache data inside the JVM is queried, this article will reveal the way to query in-memory data.

First, the primer This example uses the Hive console to query the SRC table after the cache. Select value from src

Once we have the SRC table cache in memory, we query src again, and we can observe the internal calls through the analyzed execution plan.

That is, after the parse, the Inmemoryrelation node is formed, and the method that inmemorycolumnartablescan the node is called when the physical plan is finally executed.

As follows:

[Java]View PlainCopy

scala> val exe = Executeplan (SQL ("select value from src"). queryexecution.analyzed)
14/09/: + -INFO parse. parsedriver:parsing command:select value from src
14/09/: + -INFO parse. Parsedriver:parse completed
Exe:org.apache.spark.sql.hive.test.TestHive.QueryExecution =
= = Parsed Logical Plan = =
Project [value#5]
inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
= = Analyzed Logical Plan = =
Project [value#5]
inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
= = Optimized Logical Plan = =
Project [value#5]
inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
= = Physical Plan = =
Inmemorycolumnartablescan [value#5], (inmemoryrelation [key#4,value#5], false, 1000, ( Hivetablescan [key#4,value#5], (metastorerelation default, SRC, none), none) //Query the in-memory table's entry
Code Generation: false
= = RDD = =

Second, Inmemorycolumnartablescaninmemorycolumnartablescan is a leaf node in the catalyst, containing the attributes to be queried, and Inmemoryrelation (encapsulates our cached In-columnar storage data structure). Executes the leaf node and starts the Execute method to query the memory data. 1. When querying, call Inmemoryrelation to operate on each partition of its encapsulated memory data structure. 2. Get the attributes to request, as above, the Value property of the SRC table is the query request. 3, according to the purpose of the query expression, to get in the corresponding storage structure, the index of the request column. 4, through the columnaccessor to each buffer access, get the corresponding query data, and encapsulated as a row object returned.

[Java]View PlainCopy

Private[sql] case class Inmemorycolumnartablescan (
Attributes:seq[attribute],
Relation:inmemoryrelation)
extends Leafnode {
Override def Output:seq[attribute] = attributes
Override Def execute () = {
relation.cachedColumnBuffers.mapPartitions {iterator =
//Find The ordinals of the requested columns. If None is requested, use the first.
Val requestedcolumns = if (attributes.isempty) {
Seq (0)
} Else {
Attributes.map (A = relation.output.indexWhere (_.exprid = = A.exprid)) //Find the Exprid index of the corresponding column according to the expression Bytebuffer
}
Iterator
. map (Batch = Requestedcolumns.map (_)). Map (Columnaccessor (_)))//Bytebuffer of the corresponding request column according to the index, and encapsulated as Columnaccessor.
. flatMap {columnaccessors =
Val nextRow = length of new Genericmutablerow (columnaccessors.length)//row
New Iterator[row] {
Override Def next () = {
var i = 0
While (I < nextrow.length) {
Columnaccessors (i). Extractto (NextRow, i) //based on the corresponding index and length, obtained from the Byterbuffer, encapsulated in the row
i + = 1
}
NextRow
}
Override Def Hasnext = ColumnAccessors.head.hasNext
}
}
}
}
}

The columns of the query request are as follows:

[Java]View PlainCopy

Scala> Exe.optimizedplan
Res93:org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [value#5]
inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
Scala> val relation = Exe.optimizedplan (1)
Relation:org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
scala> val request_relation = Exe.executedplan
Request_relation:org.apache.spark.sql.execution.SparkPlan =
Inmemorycolumnartablescan [value#5], (inmemoryrelation [key#4,value#5], false, 1000, ( Hivetablescan [key#4,value#5], (metastorerelation default, SRC, none), none))
scala> request_relation.output //Requested column, we requested only the Value column
Res95:seq[org.apache.spark.sql.catalyst.expressions.attribute] = ArrayBuffer (value#5)
scala> relation.output //All columns saved in relation by default
Res96:seq[org.apache.spark.sql.catalyst.expressions.attribute] = ArrayBuffer (key#4, value#5)
scala> val attributes = Request_relation.output
Attributes:seq[org.apache.spark.sql.catalyst.expressions.attribute] = ArrayBuffer (value#5)

The process is concise and the key step is the third step. The index of the request column is found according to Exprid Attributes.map (A = relation.output.indexWhere (_.exprid = = A.exprid))

[Java]View PlainCopy

Find the corresponding ID according to Exprid
scala> val attr_index = Attributes.map (A = relation.output.indexWhere (_.exprid = = A.exprid))
Attr_index:seq[int] = ArrayBuffer (1) //Find the requested column the index of value is 1, we query from the Bytebuffer of index 1, request data
Scala> Relation.output.foreach (E=>println (E.exprid))
Exprid (4) //corresponds to <span style= "font-family:arial, Helvetica, Sans-serif;" >[key#4,value#5]</span>
Exprid (5)
Scala> Request_relation.output.foreach (E=>println (E.exprid))
Exprid (5)

Third, Columnaccessor

Columnaccessor corresponds to each of these types, the class diagram is as follows:

Finally, a new iterator is returned:

[Java]View PlainCopy

New Iterator[row] {
Override Def next () = {
var i = 0
While (I < nextrow.length) { //Request column length
Columnaccessors (i). Extractto (NextRow, i)//Call Columntype.setfield (row, ordinal, extractsingle (buffer)) to resolve buffer
i + = 1
}
NextRow//Returns the parsed row
}
Override Def Hasnext = ColumnAccessors.head.hasNext
}

Iv. Summary

The query of Spark SQL in-memory columnar storage is relatively simple, and its query idea is mainly related to the stored data structure.

That is, when stored, each column is placed into a bytebuffer to form an Bytebuffer array.

When querying, the index of the above array is found based on the exprid of the requested column, then the fields in buffer are parsed using columnaccessor and finally encapsulated as a row object, which is returned.

--eof--

Create articles, reproduced please specify:

Reprinted from: Oopsoutofmemory Shengli's blog, oopsoutofmemory

This article link address: http://blog.csdn.net/oopsoom/article/details/39577419

Note: This document is based on the attribution-NonCommercial use-prohibition of the deduction of the 2.5 China (CC by-nc-nd 2.5 CN) Agreement, which is welcome to reprint, forward and comment, but please retain the author's attribution and link to the article. Please contact me if you need to negotiate for commercial purposes or in connection with licensing.

Transferred from: http://blog.csdn.net/oopsoom/article/details/39577419

Tenth: Spark SQL Source Analysis In-memory Columnar storage Source Analysis query

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Tenth: Spark SQL Source Analysis In-memory Columnar storage Source Analysis query

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Tenth: Spark SQL Source Analysis In-memory Columnar storage Source Analysis query

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support