One: Parquet use best practices for Spark SQL
1, in the past the entire industry of big data analysis of the technology stack pipeline generally divided into two ways:
A) Result Service (can be placed in db), Sparksql/impala, HDFs parquet, HDFs, Mr/hive/spark (equivalent ETL), Data Source , may also be used as a data service via JDBC/ODBC);
B) Result Service, Sparksql/impala, Export to parquet, data Source, Real time Update data to hbase/db can also be used as a data service via JDBC/ODBC);
The second method described above can be replaced by Kafka+spark Streaming+spark SQL (which internally also strongly recommends the use of Parquet to store data).
2, look forward to the way: Data Source, Kafka, spark streaming, parquet, Spark SQL (ML, GRAPHX, etc.), parquet, and other various dat A mining and so on.
II: The Essentials of Parquet Introduction
From the official website:
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of Data processing framework, data model or programming language.
1,parquet is a type of file in a Columnstore format, with the following cores in column storage:
A. You can skip non-conforming data, read only the data you need, and reduce the amount of IO data.
B. Compression encoding can reduce disk storage space. Because the same column has the same data type, you can further conserve storage space with more efficient compression encodings, such as Run Length Encoding and Delta Encoding.
C. Read only the required columns, support vector operations, and get better scan performance.
Three: The parquet meaning of Spark sql
1, if HDFs is the Big Data era file system of the fact standard, Parquet is the Big Data era storage format of the fact standard;
2, Faster: From the comparison of the speed of using spark SQL to manipulate normal file csv and parquet files, the use of parquet in most cases can be up to 10 times times faster than using ordinary files such as CSV. ; (in the case of some ordinary file system can not successfully run the program on the spark, the use of parquet to run successfully);
3,parquet compression technology is very stable, the processing of compression technology in spark SQL may not work properly (for example, it will lead to Lost Task,lost Exexutor), but at this time if the use of parquet can be done normally;
4, greatly reduces disk I/O, typically reduces the storage space by 75%, which can greatly reduce the data input of spark SQL processing data, especially in spark 1.6.x lower push filter In some cases can greatly further reduce the disk i/ o and memory occupancy;
The 5,spark 1.6.x+parquet greatly improves the throughput of data scanning, which greatly increases the speed at which the data is found, with spark 1.6 and spark 1.5 increasing 1 time times faster in Spark 1.6. X in the operation of the parquet when the use of the CPU has also been greatly optimized to effectively reduce the use of the CPU;
6, the use of parquet can greatly optimize the scheduling and execution of Spark, we test surface spark if the use of parquet can effectively reduce the stage of the execution consumption, while the execution path can be optimized;
IV: Parquet Insider decryption under Spark SQL
1, in what basic format does Columnstore store data? The expression is a tree-like structure, with meta-data in the internal table;
2, there are three components when the specific parquet file is stored:
A) Storage Format:parquet defines the type and storage format of the specific data inside;
B) The object Model Converters:parquet is responsible for the calculation of the data object in the framework and the mapping of the specific data types in the Parquet file;
C) Object Models: has a storage format defined by its own object model in parquet, for example, Avro has its own object model, but parquet uses its own object when processing data in related formats Model to store;
After the mapping is complete, parquet will make its own column Encoding, and then store the file as parquet format.
3,modules
The Parquet-format project contains format specifications and Thrift definitions of metadata required to properly read Par Quet files.
The PARQUET-MR project contains multiple sub-modules, which implement the core components of reading and writing a nested, column-oriented data stream, map this core onto the parquet format, and provide Hadoop input/output Formats, Pig loaders, and other java-based utilities for interacting with parquet.
The Parquet-compatibility project contains compatibility tests that can being used to verify this implementations in Differen T languages can read and write each of the other ' s files.
4, for example:
string owner; repeated string ownerPhoneNumbers; repeated group contacts { required string name; optional string phoneNumber; }}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
Required (occurs 1 times), optional (occurs 0 or 1 times), repeated (occurs 0 or more times)
Each record in this schema represents a person's addressbook. There are only one owner,owner can have 0 or more Ownerphonenumbers,owner can have 0 or more contacts. Each contact has and has only one name, this contact's PhoneNumber is dispensable.
1th: As far as the storage data itself is concerned, only the leaf nodes are considered, our leaf node owner, Ownerphonenumber, name, PhoneNumber;
2nd: Logically, the schema is essentially a table:
AddressBook |
|
|
|
Owner |
Ownerphonenumber |
Contacts |
|
|
|
Name |
PhoneNumber |
|
|
|
|
3rd: For a parquet file, the data is divided into the row Group (which contains a lot of column, each column has several very important features such as repetition level, the Definition level);
4th: Column in the parquet is in the form of page, the page has repetition level, Definition, and other content;
5th: Row group in the Parquet is the data read and write cache unit, so the setting of the row group will greatly affect the use of parquet speed and efficiency, so if it is to analyze the log, we generally recommend the Row group cache size is configured to about 256MB , many people's configuration is about 1G, if you want to maximize the efficiency of the operation is strongly recommended that the block size of HDFs and row group consistent;
6th: In the actual storage of a tree-like structure, through a clever coding algorithm, converted into a two-dimensional table structure
Repetition | level
Definition | Level
Value |
1 |
2 |
132990600 |
0 |
1 |
"Spark" |
0 |
0 |
Null |
Parquet in Spark SQL uses best practices and code combat