The lifetime of a SparkSQL job

Last Update:2015-12-16 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The lifetime of a SparkSQL job

Spark is a very popular computing framework developed by UC Berkeley AMP Lab, and Databricks created by the original team are responsible for commercialization. SparkSQL is an SQL solution built on Spark, focusing on interactive query scenarios.

Everyone said that Spark/SparkSQL is fast and various benchmarks are everywhere. However, few people seem to be clear about the speed or speed of Spark/SparkSQL. Because Spark is a memory-based computing framework? Because SparkSQL has a powerful optimizer? This article will show you how a SparkSQL job is executed. Then, we will discuss how SparkSQL compares with Hive On MapReduce.

The SQL On Hadoop solution is already wide-ranging, including Hive, Impala of Cloudera, Drill of MapR, Presto, SparkSQL, Apache Tajo, and IBM BigSQL, companies are trying to solve performance problems in SQL interaction scenarios because the original Hive On MapReduce is too slow.

So what is the speed of Hive On MapReduce compared with SparkSQL or other interactive engines? Let's first look at how an SQL On Hadoop engine works.

The current SQL On Hadoop job has similar working principles in the first half, similar to a Compiler.

Xiaohong is data analysis. She wrote an SQL statement one day to calculate the weighted average score of a department.

SELECT dept, avg (math_score * 1.2) + avg (eng_score * 0.8) FROM studentsGROUP BY dept;

STUDENTS table is a student Score Table (please do not care that this table does not seem to conform to the paradigm, and many data on Hadoop does not comply with the paradigm, because Join costs are high, in addition, it will be troublesome for me to write tables ). She submitted this query to an SQL On Hadoop platform for execution through Netease's big data indexing system. Then she put down her work and switched to the video webpage to see the "ranklist" for a while.

When she watched the video, our SQL platform worked very hard.

The first is query resolution.

Similar to many Compiler, you need a Parser (a well-known programmer dedicated project). Parser (specifically, Lexer and Parser) is used to convert a string stream into a Token, then, an abstract syntax tree AST is generated based on the syntax definition. This is not detailed here. For more information, see the compilation principles. A lot of projects will choose anlr (Hive, Presto, and so on). You can write Parser rules using a paradigm similar to BNF. Of course, there are also handwritten ones such as SparkSQL. AST will be further packaged into a simple basic query information object, which contains a basic query information, such as SELECT, INSERT, WHERE, what is group by? If a subquery exists, it needs to be recursive. This is generally called a logical plan.

TableScan (students)

-> Project (dept, avg (math_score * 1.2) + avg (eng_score * 0.8 ))

-> TableSink

There is no responsibility in the above diagram. The specific SQL engine will be slightly different, but it will basically do this. If you want to find an SQL engine with clear and easy-to-understand code, you can refer to Presto (which is the most beautiful in open-source code I have read ).

So far, you have converted the string into a so-called LogicalPlan. This Plan is still relatively difficult to evaluate. Basically, I still don't know what dept is. math_score is a magic horse type, and AVG is a function, which is unknown. Such a LogicalPlan can be called an Unresolved (disability) Logical Plan.

What is missing is the so-called metadata information, which mainly includes the Schema and function information of the table. The Schema information of a table mainly includes the column definition (name, type) of the table, the physical location of the table, the format, and how to read the table. The function information includes the function signature and class location.

With this, the SQL engine needs to traverse the Disability Plan just now for an in-depth analysis. The most important processing is column reference binding and function binding. Column reference binding determines the type of an expression. With the type, you can bind the function. Function binding is almost the most critical step here, because common functions such as CAST and Aggregate functions such as AVG, analytic functions, such as Rank and Table functions, such as explode, are evaluated in completely different ways. They are rewritten to independent planning nodes instead of normal Expression nodes. In addition, deep semantic detection is required. For example, whether group by includes all non-aggregate columns, whether Aggregate functions are embedded with Aggregate functions, and the most basic type compatibility check. For a strongly typed system, type inconsistency. For example, if date = '1970-01-01 ', an error is required. For a weak Type system, you can add CAST to perform Type (Type) Coerce (co-occurrence ).

Then we get a logical plan that has not been optimized:

TableScan (students => dept: String, eng_score: double, math_score: double)

-> Project (dept, math_score * 1.2: expr1, eng_score * 0.8: expr2)

-> Aggregate (avg (expr1): expr3, avg (expr2): expr4, GROUP: dept)

-> Project (dept, expr3 + expr4: avg_result)

-> TableSink (dept, avg_result-> Client)

So we can start playing meat? It's still early. The plan just now is still far behind. As an SQL engine, how can we see it without optimization? Both SparkSQL and Hive have a set of optimizers. Most SQL on Hadoop engines have rule-based optimization, and a few complex engines such as Hive have cost-based optimization. Rule optimization is easy to implement. For example, you can push the filtering conditions of Join queries to subqueries for pre-calculation, in this way, the data to be calculated during JOIN operations will be reduced (JOIN is one of the heaviest operations, and the faster JOIN operations can be performed with less data), and for example, some value optimization, such as removing the expression that evaluates to a constant. Cost-based optimization is much more complicated. For example, you can adjust the JOIN order based on the JOIN price (the most typical scenario). For SparkSQL, cost optimization is the simplest way to select a JOIN policy based on the table size (small tables can be broadcast and distributed), without the JOIN order switching, the JOIN policy is selected in the subsequent physical execution plan generation phase.

If no error has been reported, You will be lucky to get a Resolved (unimpaired) Logical Plan. This Plan, coupled with the expression valuer, allows you to query and evaluate the table on a single machine. But aren't we doing distributed systems? The data analysis sister has finished reading the title of "Lang bang". What are you waiting?

In order for our sister to finish counting hundreds of GB of data before watching the TV series, we must use the distributed power. After all, if we calculate a single node, it is enough for her sister to finish watching the entire TV series. The logical plan generated just now is called a logical plan because it seems to be logically executable. In fact, we don't know how this corresponds to a Spark or MapReduce task.

The logical execution plan needs to be converted into a physical plan that can be executed in a distributed environment. You still need to: how to connect to the engine and how to evaluate the expression.

Expression evaluate has two basic strategies: Interpreting execution, directly interpreting and executing the previous expressions. This is the current Hive mode, and code generation, including SparkSQL, impala, Drill, and so on are known as the new generation of engines that are in code generation mode (combined with high-speed compilers ). No matter what the mode is, you finally encapsulate the evaluation part of the expression into a class. The Code may look similar to the following:

// Math_score * 1.2

Val leftOp = row. get (1/* math_score column index */);

Val result = if (leftOp = null) then null else leftOp * 1.2;

Each Independent SELECT project will generate such an expression evaluate code or encapsulated evaluate tool. But what about AVG? When I first wrote wordcount, I remember that aggregate computing needs to be distributed in the Map and Reduce stages? This involves physical execution and conversion, and interconnection with distributed engines.

The aggregation calculation such as AVG, coupled with the group by Directive, tells the underlying distributed engine how to implement aggregation. In essence, AVG aggregation needs to be split into the Map stage to calculate the number of entries, as well as the number of entries and the division of each group after the second accumulation in the Reduce stage.

Therefore, the AVG we want to calculate will be further split into two planning nodes: Aggregates (Partial) and Aggregates (Final ). The Partial part is the part where we calculate Partial accumulation. Each Mapper node will execute the part. Then the underlying engine will make a Shuffle and put the same Key (Dept here) to the same Reduce node. In this way, you can get the final result after aggregation.

After splitting the aggregate function, if it is only one SQL step given in the above case, it is relatively simple. If there are multiple subqueries, you may face multiple Shuffle. For MapReduce, you need a MapReduce Job to Shuffle each time, because in the MapReduce model, Shuffle operations can be performed only through the Reduce stage. For Spark, Shuffle can be placed at will, however, you need to split the Stage according to Shuffle. After this split, you can get a DAG that is serialized by multiple MR Jobs or a DAG of multiple Spark stages (directed acyclic graph ).

Do you still remember the execution plan just now? It finally becomes such a physical execution plan:

TableScan-> Project (dept, math_score * 1.2: expr1, eng_score * 0.8: expr2)

-> AggretatePartial (avg (expr1): avg1, avg (expr2): avg2, GROUP: dept)

-> ShuffleExchange (Row, KEY: dept)

-> Aggresponfinal (avg1, avg2, GROUP: dept)

-> Project (dept, avg1 + avg2)

-> TableSink

How is this executed in MR or Spark? Before and after Shuffle, they are physically executed on different batches of computing nodes. Regardless of the corresponding MapReduce engine or Spark, they are Mapper and Reducer, with Shuffle separated. The above plan will be disconnected from ShuffleExchange and sent to Mapper and Reducer for execution respectively. Of course, in addition to the above section and the value classes mentioned earlier, they will also be serialized and sent together.

In the MapReduce model, you finally run a special Mapper and a special Reducer, which load the serialized Plan and calculator information in the initialization phase, respectively, then, evaluate each input in order in the map and reduce functions. In Spark, you generate an RDD transformation operation.

For example, for a Project operation, for MapReduce, The pseudocode is like this:

Void configuration (){

Context = loadContext ()

}

Void map (inputRow ){

OutputRow = context. projectEvaluator (inputRow );

Write (outputRow );

}

This is probably the case for Spark:

CurrentPlan. mapPartitions {iter =>

Projection = loadContext ()

Iter. map {row => projection (row )}}

So far, the engine has helped you submit jobs happily, and your cluster starts to calculate quickly and slowly.

So far, it seems that there is no difference between SparkSQL and Hive On MapReduce? In fact, SparkSQL is fast and not fast in the engine. SparkSQL engine optimization is not as complex as Hive. After all, people who have accumulated Hive for many years are not vegetarian for more than a decade. But Spark itself is fast.

Spark advertised itself as dozens of times faster than MapReduce. Many people think this is because Spark is a "memory-based computing engine". In fact, this is not true. Spark still needs to be flushed into the disk. The Shuffle process also needs to spit the intermediate data to the local disk. Therefore, Spark is a memory computing statement. It is incorrect to ignore manual Cache.

SparkSQL is faster. It's not just that the Spark engine is faster than Hive On MR.

In fact, whether it is SparkSQL, Impala or Presto, these second-generation SQL On Hadoop engines have at least three improvements, eliminating redundant HDFS read/write and redundant MapReduce stages, saves JVM startup time.

In the MapReduce model, a complete MapReduce operation must be connected to the Shuffle operation. to access an MR operation, the MR results of the previous stage must be written to HDFS, and read it again in the Map stage. This is the source of all evil.

In fact, the preceding SQL query does not necessarily differ significantly regardless of MapReduce or Spark, because it only goes through one shuffle stage.

The difference is that the query is as follows:

SELECT g1.name, g1.avg, g2.cnt

FROM (SELECT name, avg (id) AS avg FROM students group by name) g1

JOIN (SELECT name, count (id) AS cnt FROM students group by name) g2

ON (g1.name = g2.name)

Order by avg;

The corresponding MR tasks and Spark tasks are as follows:

A single HDFS Data Writing Process will actually take a very long time to read and write data on the disk because the Replication constant expands to three times. This is the main source of Spark speed. Another acceleration is from JVM reuse. Consider a Hive Task of tens of thousands of tasks. If MapReduce is used for execution, each Task starts the JVM once, And the start time of each JVM may be several seconds to dozens of seconds, the computation of a short Task may be several to dozens of seconds. When the MR Hive Task is started, the Spark Task is completed. In the case of many short tasks, this saves a lot.

For more Spark tutorials, see the following:

Install and configure Spark in CentOS 7.0

Spark1.0.0 Deployment Guide

Install Spark0.8.0 in CentOS 6.2 (64-bit)

Introduction to Spark and its installation and use in Ubuntu

Install the Spark cluster (on CentOS)

Hadoop vs Spark Performance Comparison

Spark installation and learning

Spark Parallel Computing Model

Spark details: click here
Spark: click here

This article permanently updates the link address:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More