NetEase Video Cloud Technology share: A lifetime of Sparksql's work

Source: Internet
Author: User
Tags shuffle

NetEase Video Cloud is a cloud-based distributed multimedia processing cluster and professional audio and video technology designed by NetEase to provide stable, smooth, low-latency, high-concurrency video streaming, recording, storage, transcoding and VOD, such as the PAAs service, online education, telemedicine, entertainment show, Online finance and other industries and enterprise users only through simple development can create online audio and video platform. Now, NetEase video cloud Technical experts to share a technical article: a sparksql of the work of life.

Spark is a very hot computing framework, developed by UC Berkeley AMP Lab and Databricks by the original squad, responsible for commercial related matters. Sparksql is the SQL solution built on Spark, which focuses on interactive query scenarios.

Everyone said Spark/sparksql fast, all kinds of benchmark flying, but in the end spark/sparksql fast, or quickly where, it seems that few people say clearly. Because Spark is a memory-based computing framework? Because Sparksql has a powerful optimizer? This article will show you how a sparksql job works, and by the way, how does sparksql and hive on MapReduce compare to each other at all?

The solution to the SQL on Hadoop has been everywhere, whether it's the Ganso-level Hive,cloudera impala,mapr drill,presto,sparksql or even Apache TAJO,IBM Bigsql, Companies are trying to solve the performance problem with SQL interaction scenarios because the original hive on MapReduce is too slow.

So where does hive on MapReduce compare to Sparksql or other interactive engines? Let's start with a look at how a SQL on Hadoop engine works.

Now the SQL on Hadoop job, the first half of the work of the same principle, similar to a compiler, divided to the point is the base.

Little Red is the data analysis, she wrote a SQL one day to count a sorting system of weighted mean score summary.

SELECT Dept, AVG (Math_score * 1.2) + AVG (Eng_score * 0.8) from Studentsgroup by dept;

The students table is the Student score table (please do not mind that this table does not seem to conform to the paradigm, many of the data on Hadoop does not conform to the paradigm, because the join cost is high, and I write the table will be very troublesome).

She submitted this query to a SQL on Hadoop platform via NetEase Big Data mammoth System, then she put down her job and cut to the video web page to see the Langya.

While she was watching the video, our SQL platform worked very hard.

The first is query parsing.

Here, like many compiler, you need a Parser (which is the famous Programmer's Special project), and Parser (specifically lexer plus Parser) is the function of turning a string stream into a token, and then creating an abstract syntax tree AST based on the syntax definition. This does not unfold in detail, children's shoes can refer to the compilation principle. More projects will choose ANTLR (Hive, Presto, etc.), you can write parser rules in a pattern similar to BNF, and of course there are handwritten examples such as sparksql. The AST is further packaged into a simple basic query information object that contains the basic information of a query, such as whether the type of the base statement is select or Insert,where, what GROUP by IS, and if there is a subquery, it needs to be recursively in. This thing is roughly the so-called logical plan.

Tablescan (students)

Project (Dept, AVG (Math_score * 1.2) + AVG (Eng_score * 0.8))


The above is a no-obligation signal, specific to a SQL engine will be slightly different, but basically do this. If you're looking for a code-readable SQL engine, you can refer to Presto (the most beautiful source code I've ever read).

Up to this point, you have converted the string into a so-called Logicalplan, the plan distance can be evaluated to be more disabled. Basically, I don't know what dept is, Math_score is the type of God horse, and Avg is a function, which is not clear. Such logicalplan can be called the unresolved (disabled) Logical Plan.

What is missing is the so-called metadata information, which consists mainly of two parts: the schema of the table and the function information. The schema information of the table mainly contains the table's column definition (name, type), the physical location of the table, the format, how to read, function information is the function signature, the location of the class, and so on.

With this in place, the SQL engine needs to traverse the crippled plan once more to make an in-depth analysis. The most important processing is the column reference binding and the function binding. A column-reference binding determines the type of an expression. And with the type you can do function binding. function binding is almost the most critical step here, because ordinary functions such as cast, and aggregate functions such as AVG, analytic functions such as rank and table function, such as explode, are evaluated in a completely different way, and they are rewritten as separate planning nodes. It is no longer a normal expression node. In addition, there is a need for in-depth semantic testing. For example, whether the group by includes all non-aggregated columns, whether the aggregate function has an aggregate function embedded, and the most basic type compatibility check, for strongly typed systems, type inconsistency such as Date = ' 2015-01-01 ' requires an error, for weakly typed systems, you can add cast to do type coerce (type) (fornication).

Then we have a logical plan that has not been optimized:

Tablescan (students=>dept:string, eng_score:double, math_score:double)

->project (Dept, Math_score * 1.2:expr1, Eng_score * 0.8:expr2)

->aggregate (AVG (EXPR1): EXPR3, Avg (EXPR2): EXPR4, Group:dept)

->project (Dept, Expr3+expr4:avg_result)

->tablesink (Dept, Avg_result->client)

So we can start the meat scene? It's early.

Just the plan, is still very far, as a SQL engine, no optimization how good to see people? Whether it's sparksql or hive, there's a set of optimizer. Most SQL on Hadoop engines have rule-based optimizations, and a handful of complex engines such as hive have cost-based optimizations. Rule optimization is easy to implement, such as the classic predicate push-down, you can push the filter of the join query to the subquery pre-calculation, so that the data needed to calculate the join is reduced (join is one of the heaviest operations, the less data can be used to do the join will be faster), such as some evaluation optimization, Like removing an expression that evaluates to a constant, and so on. The cost-based optimization is much more complex, such as adjusting the join order (the most classic scenario) based on the join cost, and for sparksql, the cost optimization is the simplest to choose the join strategy based on the table size (the small table can be distributed by broadcast), without the join order to exchange these. The join policy selection is then the physical execution plan generation phase to be explained.

Here, if you haven't made an error, you're lucky to get a resolved (not crippled) Logical plan. This plan, coupled with the expression of the evaluation, you can toss and toss in the stand-alone table query evaluation. But aren't we doing a distributed system? Data analysis sister has read the "Langya" the title, you are still leisurely what?

In order to let sister before watching TV show hundreds of g of data, we must rely on the power of distribution, after all, single-node count is enough for her sister to read the whole Langya list of episodes. The logical plan that you just generated is called a logical plan because it just seems logical to execute (mistakenly), and we don't actually know how this works for spark or mapreduce tasks.

The logical execution plan next needs to be converted into a physical plan that can be executed in a distributed situation, and you also lack: How to interface with the engine, how to do the expression evaluation of two parts.

There are two basic strategies for evaluating an expression, one interpreting execution, and interpreting the previously brought expression, which is the current pattern of hive, and the other is code generation, including Sparksql,impala, Drill and so on called the new generation of engines are code generation mode (and with the high-speed compiler). Regardless of the pattern, you finally encapsulate the expression evaluation part as a class. The code may look similar to the following:

Math_score * 1.2val leftop = row.get (1/* math_score column index */);

Val result = if (Leftop = = null) then NULL else leftop * 1.2;

Each stand-alone select project generates such an expression-evaluation code or a encapsulated-value-finding device. But what about AVG? When I wrote WordCount, I remember that aggregation calculations need to be assigned to the map and reduce two stages? This involves the physical execution of transformations, involving the docking of distributed engines.

The aggregation calculations such as AVG, plus the group by instructions, tell the underlying distributed engine how you need to do aggregations. In essence, Avg aggregation needs to be split into a map phase to calculate the sum of entries, as well as the number of items, and the reduce phase two increments and finally each group divides.

So the avg we're going to calculate is actually going to split up into two program nodes: Aggregates (Partial) and aggregates (Final). The partial part is the part where we calculate the local accumulation, each mapper node executes, and the underlying engine makes a shuffle that distributes the same key (here is dept) to the same reduce node. You can get the final result by eventually aggregating it.

After the aggregation function, if only the above case to a step of SQL, it is relatively simple, if there are many sub-query, then you may face multiple shuffle, for MapReduce, each time you need a mapreduce job to support, Because of the MapReduce model, the shuffle operation can only be done through the reduce phase, and for Spark, the shuffle may be placed randomly, but you have to split the stage according to Shuffle. After this is done, you get a DAG with multiple Mr Jobs or a DAG with multiple stages of spark (directed acyclic graph).

Do you remember the execution plan you just made? It finally becomes a physical execution plan like this:

Tablescan->project (Dept, Math_score * 1.2:expr1, Eng_score * 0.8:expr2)

Aggretatepartial (AVG (EXPR1): AVG1, Avg (EXPR2): Avg2, Group:dept)

Shuffleexchange (Row, key:dept)

Aggregatefinal (AVG1, AVG2, group:dept)

Project (Dept, AVG1 + avg2)


How exactly does this thing execute in Mr or spark? Before and after the corresponding shuffle, they will be physically executed on different batches of compute nodes. Whether they correspond to the MapReduce engine or spark, they are mapper and reducer, separated by shuffle. The above plan will be broken by the middle of the Shuffleexchange, sent to mapper and reducer, respectively, and, of course, in addition to the above section and the previously mentioned evaluation class, will also be serialized together sent.

In the MapReduce model, you end up with a special mapper and special reducer that load the serialized plan and evaluation information in the initialization phase, then evaluate each input sequentially in the map and reduce functions, and in Spark, What you generate is an RDD transform operation.

For example, a project operation, for MapReduce, pseudo-code is probably the case:

void configuration () {

context = Loadcontext ()

}void Map (Inputrow) {

Outputrow = Context.projectevaluator (Inputrow);

Write (Outputrow);


For Spark, this is probably the case:

currentplan.mappartitions {iter = projection = Loadcontext () {row = projection (row)}}

So far, the engine has helped you to submit the job happily, and your cluster has begun to calculate in a leisurely way.

So far, it seems that there is no difference between sparksql and hive on MapReduce? In fact sparksql fast, and unpleasant in the engine.

Sparksql's engine optimization, and no hive complex, after all, human hive accumulation for many years, more than 10 years down is not vegetarian. But spark itself is fast.

Spark advertised itself dozens of times times faster than MapReduce, and many people thought it was because spark was a "memory-based Computing engine", which is not true. Spark still has to drop the disk, and the shuffle process needs to spit the intermediate data on the local disk as well. So spark is based on memory calculations, and it's not correct to consider the manual cache scenario.

Sparksql fast, not at all just said that a lump of things than hive on Mr Faster, but the spark engine itself is fast.

In fact, the second generation of SQL on Hadoop engines, both Sparksql,impala and Presto, have made at least three improvements, eliminating redundant HDFS read-write, redundant mapreduce phases, and saving JVM startup time.

In the MapReduce model, a shuffle operation is required to access a complete mapreduce operation, and access to an Mr operation requires that the previous phase of the Mr Result be written to HDFs and re-read in the map phase, which is the root of all evils.

In fact, if it's just the SQL query above, either MapReduce or spark, it's not necessarily a significant difference, because it only passes through a shuffle phase.

The true manifestation of the difference is such a query:

Select, G1.avg, G2.cntfrom (select name, AVG (ID) as AVG from students GROUP by name) g1join (select Name, COUNT (ID As CNT from students GROUP by name) g2on ( = ORDER by AVG;

And their corresponding Mr Tasks and spark tasks are as follows:

One HDFs intermediate data write, in fact, because the replication constant expansion to three times times the write, and disk read and write is very time-consuming. This is the main source of spark speed.

Another acceleration, from the JVM reuse. Consider a task of a tens of thousands of tasks, if executed with MapReduce, each task will start the JVM once, and each time the JVM startup time may be a few seconds to more than 10 seconds, and a short task of the calculation itself may be a few seconds to more than 10 seconds, when the Mr Hive task started to complete, The task of Spark has already been counted over. For short task-many scenarios, this is a great savings.

NetEase Video Cloud Technology share: A lifetime of Sparksql's work

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.