Day63-spark SQL under Parquet Insider deep decryption

Source: Internet
Author: User
Tags object model

DT Big Data Dream Factory Contact information:

Sina Weibo: www.weibo.com/ilovepains/
Public Number: Dt_spark

Blog: http://.blog.sina.com.cn/ilovepains

One: Rethinking the meaning of parquet under the Sparksql

Storage spaces include: memory and disk, computational aspects

If you say HDFS is the fact standard for the storage of distributed file systems in the era of big data, Parquet is the fact standard for file storage formats throughout the big Data age.

1, Faster: using Sparksql to operate common file csv and parquet file speed ratio, most of the time use of parquet will be compared to the use of common files such as CVS Block 10 is around (in some ordinary file system cannot successfully run the program on Spark, Use Paquet to run successfully in many cases)

2, Parquet compression technology non-prudential stability Excellent, in the Sparksql compression technology processing may not work properly (such as Lost task, Lost Executor), but at this time if the use of parquest can be done normally.

3, greatly reduces the disk IO, usually can reduce the storage space of 75%, thus can greatly reduce sparksql processing data when the data access content, especially in the spark1.6.x to introduce filters, in some cases can greatly reduce the disk IO and memory consumption. For example

4, spark1.6.x+ parquet greatly improve the throughput of data scanning, which greatly improves the speed of data query.

Spark1.6.x and spark1.5.x compared to the increase of about 1 time times the speed, in the spark1.6.x operation of the Parquet CPU use is also greatly optimized, effectively reduce the use of the CPU.

5, can greatly optimize the scheduling and execution of Spark. testing if spark uses parquet, it can effectively reduce the execution consumption of the stage while optimizing the execution path:

II: Sparksql under the Parquet Insider

1. What is the underlying format for storing data in Columnstore? Represented by a tree-like data structure, with metadata in the internal table

2, in the specific parquet file storage when there are 3 core components:

A) storageformat:parquest defines the type and storage format of the specific data inside

b) ObjectModel Converters:parquet is responsible for calculating the mapping of specific data types in data objects and parquet files in the framework

c) Objectmodels: A storage format that has its own object model definition in Parquet For example: Avro has its own object model, but

3.

4. Examples and explanations

Messageaddressbook {

Required Stringowner;

Repeated stringownerphonenumbers;

Repeated groupcontacts {

Required String name; Optional stringphonenumber; }

}

Required (occurs 1 times), optional (occurs 0 or 1 times), repeated (occurs 0 or more times)


Each record in this schema represents a person's addressbook. There are only one owner,owner can have 0 or more Ownerphonenumbers,owner can have 0 or more contacts. Each contact has and has only one name, this contact's PhoneNumber is dispensable.

1th: In terms of the stored data itself, only the leaf nodes are considered, and our leaf nodes owner , Ownerphonenumber , name , PhoneNumber

2nd: Schema is actually a Table

addressbook

owner

ownerphonenumber

 

 

 

 

 

 

3rd: for parquet file, The data is divided into rowgroup column column repatitionlevel definition level )

4th: column in Span style= "color:red" >parquet page page repatitionlevel definition level

5th: Rowgroup in Span style= "color:red" >parquet so for rowgroup The setting of parquet The use of speed and efficiency, so if the analysis of the log, we generally recommend the row group The cache size of 256m Many people are configured to be larger than 1g highly recommended hdfs block size and rowgroup

6th: In the actual storage of a tree-like structure, through a clever coding algorithm to convert to a two-dimensional code structure:

Sparksql the parquet as the memory default storage format,

Columnstore Each column is not related to other data

Query engine: Hive, Impala, Pig, Presto, Drill, Tajo, HAWQ, IBM bigsql
Computational framework: MapReduce, Spark, cascading, Crunch, Scalding,kite
Data model: Avro, Thrift, Protocol buffers, POJOs


Day63-spark SQL under Parquet Insider deep decryption

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.