Tenth: Spark SQL Source Analysis In-memory Columnar storage Source Analysis query

Source: Internet
Author: User

/** Spark SQL Source Analysis series Article */

As mentioned earlier, the storage structure of spark SQL In-memory columnar storage is based on column storage.

So based on the above storage structure, we query how the cache data inside the JVM is queried, this article will reveal the way to query in-memory data.

First, the primer This example uses the Hive console to query the SRC table after the cache. Select value from src

Once we have the SRC table cache in memory, we query src again, and we can observe the internal calls through the analyzed execution plan.

That is, after the parse, the Inmemoryrelation node is formed, and the method that inmemorycolumnartablescan the node is called when the physical plan is finally executed.

As follows:

[Java]View PlainCopy
  1. scala> val exe = Executeplan (SQL ("select value from src"). queryexecution.analyzed)
  2. 14/09/: + -INFO parse. parsedriver:parsing command:select value from src
  3. 14/09/: + -INFO parse. Parsedriver:parse completed
  4. Exe:org.apache.spark.sql.hive.test.TestHive.QueryExecution =
  5. = = Parsed Logical Plan = =
  6. Project [value#5]
  7. inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
  8. = = Analyzed Logical Plan = =
  9. Project [value#5]
  10. inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
  11. = = Optimized Logical Plan = =
  12. Project [value#5]
  13. inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
  14. = = Physical Plan = =
  15. Inmemorycolumnartablescan [value#5], (inmemoryrelation [key#4,value#5], false, 1000, ( Hivetablescan [key#4,value#5], (metastorerelation default, SRC, none), none) //Query the in-memory table's entry
  16. Code Generation: false
  17. = = RDD = =

Second, Inmemorycolumnartablescaninmemorycolumnartablescan is a leaf node in the catalyst, containing the attributes to be queried, and Inmemoryrelation (encapsulates our cached In-columnar storage data structure). Executes the leaf node and starts the Execute method to query the memory data. 1. When querying, call Inmemoryrelation to operate on each partition of its encapsulated memory data structure. 2. Get the attributes to request, as above, the Value property of the SRC table is the query request. 3, according to the purpose of the query expression, to get in the corresponding storage structure, the index of the request column. 4, through the columnaccessor to each buffer access, get the corresponding query data, and encapsulated as a row object returned.

[Java]View PlainCopy
  1. Private[sql] case class Inmemorycolumnartablescan (
  2. Attributes:seq[attribute],
  3. Relation:inmemoryrelation)
  4. extends Leafnode {
  5. Override def Output:seq[attribute] = attributes
  6. Override Def execute () = {
  7. relation.cachedColumnBuffers.mapPartitions {iterator =
  8. //Find The ordinals of the requested columns.  If None is requested, use the first.
  9. Val requestedcolumns = if (attributes.isempty) {
  10. Seq (0)
  11. } Else {
  12. Attributes.map (A = relation.output.indexWhere (_.exprid = = A.exprid)) //Find the Exprid index of the corresponding column according to the expression Bytebuffer
  13. }
  14. Iterator
  15. . map (Batch = Requestedcolumns.map (_)). Map (Columnaccessor (_)))//Bytebuffer of the corresponding request column according to the index,  and encapsulated as Columnaccessor.
  16. . flatMap {columnaccessors =
  17. Val nextRow = length of new Genericmutablerow (columnaccessors.length)//row
  18. New Iterator[row] {
  19. Override Def next () = {
  20. var i = 0
  21. While (I < nextrow.length) {
  22. Columnaccessors (i). Extractto (NextRow, i) //based on the corresponding index and length, obtained from the Byterbuffer, encapsulated in the row
  23. i + = 1
  24. }
  25. NextRow
  26. }
  27. Override Def Hasnext = ColumnAccessors.head.hasNext
  28. }
  29. }
  30. }
  31. }
  32. }

The columns of the query request are as follows:

[Java]View PlainCopy
  1. Scala> Exe.optimizedplan
  2. Res93:org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
  3. Project [value#5]
  4. inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
  5. Scala> val relation = Exe.optimizedplan (1)
  6. Relation:org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
  7. inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
  8. scala> val request_relation = Exe.executedplan
  9. Request_relation:org.apache.spark.sql.execution.SparkPlan =
  10. Inmemorycolumnartablescan [value#5], (inmemoryrelation [key#4,value#5], false, 1000, ( Hivetablescan [key#4,value#5], (metastorerelation default, SRC, none), none))
  11. scala> request_relation.output //Requested column, we requested only the Value column
  12. Res95:seq[org.apache.spark.sql.catalyst.expressions.attribute] = ArrayBuffer (value#5)
  13. scala> relation.output //All columns saved in relation by default
  14. Res96:seq[org.apache.spark.sql.catalyst.expressions.attribute] = ArrayBuffer (key#4, value#5)
  15. scala> val attributes = Request_relation.output
  16. Attributes:seq[org.apache.spark.sql.catalyst.expressions.attribute] = ArrayBuffer (value#5)



The process is concise and the key step is the third step. The index of the request column is found according to Exprid Attributes.map (A = relation.output.indexWhere (_.exprid = = A.exprid))

[Java]View PlainCopy
  1. Find the corresponding ID according to Exprid
  2. scala> val attr_index = Attributes.map (A = relation.output.indexWhere (_.exprid = = A.exprid))
  3. Attr_index:seq[int] = ArrayBuffer (1) //Find the requested column the index of value is 1, we query from the Bytebuffer of index 1, request data
  4. Scala> Relation.output.foreach (E=>println (E.exprid))
  5. Exprid (4) //corresponds to <span style= "font-family:arial, Helvetica, Sans-serif;" >[key#4,value#5]</span>
  6. Exprid (5)
  7. Scala> Request_relation.output.foreach (E=>println (E.exprid))
  8. Exprid (5)
Third, Columnaccessor

Columnaccessor corresponds to each of these types, the class diagram is as follows:

Finally, a new iterator is returned:

[Java]View PlainCopy
  1. New Iterator[row] {
  2. Override Def next () = {
  3. var i = 0
  4. While (I < nextrow.length) { //Request column length
  5. Columnaccessors (i). Extractto (NextRow, i)//Call Columntype.setfield (row, ordinal, extractsingle (buffer)) to resolve buffer
  6. i + = 1
  7. }
  8. NextRow//Returns the parsed row
  9. }
  10. Override Def Hasnext = ColumnAccessors.head.hasNext
  11. }

Iv. Summary

The query of Spark SQL in-memory columnar storage is relatively simple, and its query idea is mainly related to the stored data structure.

That is, when stored, each column is placed into a bytebuffer to form an Bytebuffer array.

When querying, the index of the above array is found based on the exprid of the requested column, then the fields in buffer are parsed using columnaccessor and finally encapsulated as a row object, which is returned.

--eof--

Create articles, reproduced please specify:

Reprinted from: Oopsoutofmemory Shengli's blog, oopsoutofmemory

This article link address: http://blog.csdn.net/oopsoom/article/details/39577419

Note: This document is based on the attribution-NonCommercial use-prohibition of the deduction of the 2.5 China (CC by-nc-nd 2.5 CN) Agreement, which is welcome to reprint, forward and comment, but please retain the author's attribution and link to the article. Please contact me if you need to negotiate for commercial purposes or in connection with licensing.

Transferred from: http://blog.csdn.net/oopsoom/article/details/39577419

Tenth: Spark SQL Source Analysis In-memory Columnar storage Source Analysis query

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.