"Reprint" Apache Spark Jobs Performance Tuning (i)

Source: Internet
Author: User
Tags new set shuffle

When you start writing Apache Spark code or browsing public APIs, you will encounter a variety of terminology, such as Transformation,action,rdd and so on. Understanding these is the basis for writing Spark code. Similarly, when your task starts to fail or you need to understand why your application is so time-consuming through the Web interface, you need to know some new nouns: job, stage, task. Understanding these new terms helps to write good Spark code. The good here mainly refers to the faster Spark program. The understanding of Spark's underlying execution model is useful for writing more efficient spark programs.


how Spark executes the program


A Spark application consists of a driver process and several executor processes that are distributed across the nodes of the cluster.


Driver is primarily responsible for scheduling some high-level task flows (flow of work). Exectuor is responsible for performing these tasks, which exist as a task, while storing the data that the user settings need to caching. The life cycle of the task and all executor is the entire program's running process (which may not be the case if the dynamic resource allocation is used). How these processes are dispatched is accomplished through cluster management applications (such as Yarn,mesos,spark Standalone), but any Spark program contains a driver and multiple executor processes.



At the top of the execution hierarchy is a series of jobs. Invoking an action inside spark generates a spark job to complete it. To determine the actual content of these jobs, Spark examines the RDD Dag and calculates the execution plan. This plan takes the far end of the RDD as the starting point (the furthest point is the externally dependent RDD or the RDD that the data is already cached), and the action that produces the RDD ends.

The plan executed consists of a series of stages, which is a combination of the transformation of the job, and the stage corresponds to a series of tasks, which refers to the same code executed for different datasets. Each stage contains a sequence of transformation that do not require shuffle data.


What determines if data needs to be shuffle? The RDD contains a fixed number of partition, each Partiton containing several record. For those who pass narrow tansformation (such asMapAndFilterThe returned RDD, a record in a partition only needs to be computed from the record in the partition corresponding to the parent RDD. Each object relies on only one object of the parent RDD. Some operations (such asCOALESCE) may cause a task to handle multiple input partition, but this transformation is still considered narrow because the multiple input record used for the calculation is always from a finite number of partition.


However, Spark also supports transformation that require wide dependencies, such asGroupbykeyReducebykey。 In this dependency, the calculation of the data in a partition requires that the data be read from multiple partition in the parent RDD. All tuples that have the same key will eventually be aggregated into the same partition and processed by the same stage. To do this, spark needs to shuffle the data, meaning that the data needs to be passed within the cluster, resulting in a new stage composed of a new partition collection.


For example, in the following code, there is only one action and a series of RDD from a text string, and the code has only one stage, because no operation needs to read the data from a different partition.

 
<span style= "Font-family:microsoft yahei;font-size:14px;color: #333333;" >sc.textfile ("SomeFile.txt").  Map (mapfunc).  FlatMap (flatmapfunc).  Filter (FILTERFUNC).  Count () </span>

  

Unlike the code above, the following code requires a total of 1000 letters to be counted.

<span style= "Font-family:microsoft yahei;font-size:14px;color: #333333;" >val tokenized = sc.textfile (args (0)). FlatMap (_.split (")) Val wordcounts = Tokenized.map ((_, 1)). Reducebykey (_ + _) Val filtered = Wordcounts.filter (_._2 >=) val charcounts = Filtered.flatmap (_._1.tochararray). Map ((_, 1)).  Reducebykey (_ + _) Charcounts.collect () </span>

  


This code can be divided into three stage stages. The Recudebykey operation is the demarcation between the stages, because the output of the computed Recudebykey needs to be redistributed partition.


There is also a more complex transfromation diagram that contains a join transformation with multiple dependencies.

The pink box shows the stage diagram used by the runtime. When you run to the boundary of each stage, the data is written to disk in the parent stage, and the data is read through the network in the child stage by the task. These operations can cause heavy network and disk I/O, so the stage boundary is very resource-intensive and should be avoided when writing Spark programs. The number of partition in the parent stage may be different from the partition of the child stage, so the transformation that produce the stage boundary often need to accept a numpartition parameter to feel the data in the child stage The number of partition to be cut into.


Just as debugging MapReduce is a very important parameter to select the number of Reducor, adjusting the number of partition on the stage will often affect the execution efficiency of the program to a great extent. We'll discuss how to adjust these values in a later section. Choose the right Operator
When you need to do a feature with Spark, programmers need to choose different scenarios from different actions and transformation to get the same results. However, the efficiency of the final implementation may be cloud Moss in different scenarios. Avoiding the common pitfalls of choosing the right solution can make a huge difference in the final performance. Some rules and in-depth understanding can help you make better choices.


Start stabilizing Schemardd in the latest Spark5097 documentation (the dataframe that spark 1.3 starts with), which will open the Catalyst optimizer of spark for programmers using the Spark core API, allowing spark Make more advanced choices when using Operator. When Schemardd is stable, some decisions will not need to be considered by the user.


The primary goal of selecting the Operator scheme is to reduce the number of shuffle and the size of the files being shuffle. Because shuffle is the most resource-intensive operation, there are shuffle of data that need to be written to disk and delivered over the network. RepartitionJoinCogroup, as well as any *byOr *bykeyTransformation all require shuffle data. Not all of these Operator are equal, but there are some common performance pitfalls to be aware of.

    • Avoid using GroupbykeyWhen you are working on a combined protocol. For example, Rdd.groupbykey (). Mapvalues (_. Sum) executes the same result as Rdd.reducebykey (_ + _), but the former needs to pass all the data over the network, and the latter only need to be based on each key's local Partition cumulative results, after shuffle, the local cumulative values are added and then the results are obtained.
    • Avoid using Reducebykeywhen the input and input types are inconsistent. For example, we need to implement finding all the different strings for each key. One method is to use the map to convert each element into a set, and then use reducebykey to merge the set together
<span style= "Font-family:microsoft yahei;font-size:14px;color: #333333;" >rdd.map (kv = (kv._1, new set[string] () + kv._2))    . Reducebykey (_ + + _) </span>

  


This code generates countless non-mandatory objects, because each one needs to create a new Set for each record. Used here AggregatebykeyMore suitable, because this operation is in MapStage to do the aggregation.
<span style= "Font-family:microsoft yahei;font-size:14px;color: #333333;" >val zero = new Collection.mutable.set[string] () Rdd.aggregatebykey (zero) (    set, v) = set + = V,    (Set1, Set2) = Set1 ++= Set2) </span>

  

    • Avoid the flatmap-join-groupby pattern. When you have two datasets that have been grouped by key, you want to merge two datasets and keep grouping, which can be done using Cogroup. This avoids the overhead of packaging and unpacking packets to the group.
When does not happen Shuffle
Of course it is also important to know which transformation will not happen shuffle. Currently a transformation has used the same patitioner to divide the data patition, and spark knows how to avoid shuffle. Refer to the code:
<span style= "Font-family:microsoft yahei;font-size:14px;color: #333333;" >RDD1 = Somerdd.reducebykey (...) RDD2 = Someotherrdd.reducebykey (...) RDD3 = Rdd1.join (RDD2) </span>

  



Because no partitioner passed to Reducebykey, so the system uses the default Partitioner, so both rdd1 and rdd2 use hash to divide partition. Two of the Code ReducebykeyWill occur two times shuffle. If the RDD contains the same number of partition, no additional shuffle will occur when the join occurs. Because the RDD here uses the same hash method for partition, the set of keys in the same partition in all the RDD is the same. Therefore, the output of a partiton in RDD3 only relies on the same partition of RDD2 and RDD1, so the third shuffle is unnecessary.


As an example, when Somerdd has 4 partition, Someotherrdd has two partition, two ReducebykeyUse 3 Partiton, all of the tasks will be executed as follows: if RDD1 and RDD2 ReducebykeyWhen using different partitioner or using the same partitioner but the number of partition is different, then only one RDD (less partiton number) needs to be shuffle again.


Same tansformation, same input, different number of partition: When two datasets need to join, one way to avoid shuffle is to use broadcast variables. If a dataset is small enough to be plugged into a executor memory, it can be written to a hash table in driver and then broadcast to all executor. Then map transformation can refer to this hash table for query.
under what circumstances, the more Shuffle, the better.
There are exceptions to the guidelines for minimizing shuffle. This can also improve performance if additional shuffle can increase concurrency. For example, when your data is stored in a few large files that have not been sliced, then using InputFormat to generate partition may result in a large number of record aggregates in each Partiton, and if partition is not enough, it will not start enough concurrency. In this case, we need to use the data after loading Repartiton(causes shuffle) to increase the number of Partiton so that the CPU of the cluster can be fully used.


Another exception is when using Recude or aggregate action to aggregate data to driver, if the data put a lot of Partititon data, a single process to execute the driver merge all partition output can easily become The bottleneck of the calculation. In order to relieve the computational pressure of driver, you can use ReducebykeyOr AggregatebykeyPerforming distributed aggregate operations distributes data to fewer partition. The data in each partition is merge in parallel, and the result of the merge is driver for the final round of aggregation. View TreereduceAnd treeaggregateSee examples of how to use this.


This technique is particularly useful in datasets that have been aggregated by Key, such as when an application needs to count the occurrences of each word in a corpus and output the results to a map. One way to achieve this is to use Aggregation, compute a map locally in each partition, and then merge the maps computed in each partition in the driver. Another way is through AggregatebykeyThe operation of the merge is distributed across the Partiton, and then in a simple way CollectasmapOutput the results to the driver.


Two orders
There is also an important skill to understand the interface repartitionandsortwithinpartitionsTransformation. This is a very obscure transformation, but can cover a variety of strange situations of the sort, this transformation put the sort deferred to shuffle operation, which makes a lot of data valid output, sorting operations can be combined with other operations.


For example, Apache Hive on Spark uses this transformation in the implementation of join. And this operation plays a crucial role in the secondary sort mode. Secondary sort mode refers to the user expects data to be grouped by key and wants to traverse value in a specific order. Use repartitionandsortwithinpartitionsPlus a part of the user's extra work can achieve secondary sort.

Conclusion
You should now have a good understanding of all the essential elements needed to complete an efficient Spark program. In part II, you will learn more about resource invocation, concurrency, and data structure-related debugging. "Reprinted from: http://blog.csdn.net/u012102306/article/details/51700491"

"Reprint" Apache Spark Jobs Performance Tuning (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.