"Spark" 9. Spark Application Performance Optimization |12 optimization method __spark

Source: Internet
Author: User
Tags data structures new set serialization shuffle knowledge base
1. Optimization? Why? How? When? What?

"Spark applications also need to be optimized. "Many people may have this question," not already have code generators, executive optimizer, pipeline or something. ”。 Yes, Spark does have some powerful built-in tools to make your code faster when it executes. But if everything depends on the tools, framework to do, I think that can only illustrate two questions: you are only aware of the framework, but not the reason why; it seems you are only divert, without you, others can easily write such a spark application, so you are replaceable;

When doing spark application optimizations, it's enough to start at the following points: 2.repartition and coalesce

Original

Spark provides the ' repartition () ' function, which shuffles the data 
across the network to create a new set of Partiti Ons. Keep in mind that 
repartitioning your The data is a fairly expensive.  Spark 
also has a optimized version of ' repartition () ' called ' Coalesce () ' 
that allows avoiding data movement, but Only if you are decreasing the number of 
RDD partitions.  To know whether can safely call 
coalesce (), can check the size of the RDD using ' rdd.partitions.size () ' in 
Java/scala and ' rdd.getnumpartitions () ' in Python and make sure this you 
are coalescing it to fewer partitions I T currently has.

Summary : When you want to RDD, if the target area is smaller than the current number of pieces, then use coalesce, do not use repartition. For more optimization details on partition, refer to Chapter 4 of Learning Spark 3.Passing functions to Spark

In Python, we have three the options for passing functions into Spark. Lambda expressions

Word = Rdd.filter (lambda s: "error" in s)
Lambda expressions
Import my_personal_lib
word = rdd.filter (my_personal_lib.containserror)
Locally defined Functions
def containserror (s): Return
    "error" in s
word = Rdd.filter (containserror)

One issue to watch out as when passing functions is inadvertently serializing the object containing the function. When you pass a function of "is" the member of object, or contains references to fields in object (e.g., Self.field) , Spark sends the entire object to worker nodes, which can is much larger than the bit of information. Sometimes this can also cause your program to fail, if your class contains objects that Python can ' t figure out how to pic Kle.

### wrong way
class Searchfunctions (object):
  def __init__ (self, query):
      self.query = Query
  def ismatch (self, s): Return
      self.query in S-
  def getmatchesfunctionreference (self, Rdd):
      # problem:references all of "s Elf "in" Self.ismatch "return
      rdd.filter (self.ismatch)
  def getmatchesmemberreference (self, Rdd):
      # Problem:references all of ' self ' in ' self.query ' return
      rdd.filter (lambda x:self.query into x)
### the right way
class Wordfunctions (object): ...
  def getmatchesnoreference (self, Rdd):
      # safe:extract only the field we need to a local variable
      query = self. Query return
      rdd.filter (lambda x:query in x)
4.worker Resource allocation: CPU, Memroy, executors

This topic is quite deep, and it's not the same in different deployment modes [standalone, yarn, Mesos]. The only purpose is not to consider all the resources are independent to spark to use, to take into account the machine itself some processes, spark dependent on some of the processes, network conditions, task conditions [compute-intensive, IO-intensive, long-live task] and so on.

Here can only recommend some video,slide and blog, specific situation specific analysis, later I encountered resource tuning when the actual case sent out.

Top 5 Mistakes when writing Spark applications 5.shuffle block size limitation

No Spark Shuffle block can be greater than 2 GB. The block size in spark shuffle cannot be greater than 2g.

Spark uses a data structure called Bytebuffer as a cache of shuffle data, but the Bytebuffer default allocated memory is 2g, so the shuffle process can go wrong once the Shuflle data exceeds 2g. The factors that affect the size of the shuffle data are the following common several:

The number of partition, the more partition, the less data distributed to each partition, the more difficult it is to cause shuffle data to be too large;

Data distribution is uneven, generally after Groupbykey , there are some key contains data is too large, resulting in the key on the partition data is too large, it is possible to trigger later Shuflle block larger than 2g;

The general solution to this kind of method is to increase the number of partition, top 5 mistakes when writing Spark applications here that can be expected to make each partition on the data for 128MB, only for reference, or need specific field Landscape specific analysis, here only the principle of the line, and do not have a perfect specification. specifies a larger partition number when Sc.textfile spark.sql.shuffle.partitions rdd.repartition rdd.coalesce

TIPS:

In two scenarios where partition is less than 2000 and greater than 2000, Spark uses different data structures to record relevant information at shuffle time, and when partition is greater than 2000, there is another more efficient [compressed] data structure to store the information. So if your partition is not 2000, but very close to 2000, you can safely set the partition to more than 2000.

def apply (Loc:blockmanagerid, Uncompressedsizes:array[long]): Mapstatus = {
    if (Uncompressedsizes.length > 2000 {
      highlycompressedmapstatus (loc, uncompressedsizes)
    } else {
      new Compressedmapstatus (Loc, uncompressedsizes)
    }
  }
6. level of Parallel-partition

Let's take a look at some of the performance metrics for all task runs in a stage, some of which are described:

Scheduler Delay:spark The time it takes to assign a task

Executor Computing time:executor time spent executing a task

getting result time: Get the task execution results

result serialization Time:task execution results serialization time

Task deserialization Time:task deserialization time

Shuffle Write Time:shuffle writes data time Shuffle Read Time:shuffle time spent reading data

And the level of parallel, in fact, in most cases refers to the number of partition, partition number of changes will affect the changes in the above several indicators. When we tune, we often look at the above indicators change. When the partition changes, the above indicators change as follows: partition too small [easy to introduce data skew problem]
Scheduler Delay: No significant changes Executor Computing time: Unstable, there are large and small, but the average down relatively big getting result time: unstable, there are large and small, but the average down relatively big result Serializat Ion time: Unstable, there are large and small, but the average down relatively big Task deserialization time: Unstable, there are big and small, but the average down relatively large Shuffle write time: Unstable, there are big and small, but the average down relatively large Shuffle Rea D Time: Unstable, there are big and small, but the average down relatively large partition too big
Scheduler Delay: No significant changes Executor Computing time: Relatively stable, average down relatively small getting result time: relatively stable, average down compared to small result serialization time: Relatively stable, average down compared to small Task deserialization time: relatively stable, average down relatively small Shuffle Write time: More stable, the average down relatively small Shuffle Read time: Relatively stable, average down relatively small

So how do you set the number of partition? There is also no specific formula and specifications, generally after a few attempts to have a more optimized results. But the aim is: try not to cause data skew problem, try to make each task execution time in a small change in the interval. 7. Data Skew

Most of the time, the benefits of the distributed computing we want to do should be like the following:

However, sometimes, it is the following effect, which is called the data skew. That is, the data is not evenly distributed to the cluster, so that for a task, the execution time of the entire task depends on the time the first block of data is processed. In many distributed systems, data skew is a big problem, for example, a distributed cache, assuming that there are 10 cache machines, but 50% of the data fall on one of the machines, then when the machine down, the entire cached data will be discarded, the cache hit rate at least [certainly greater than] Reduced by 50%. This is also a lot of distributed cache to introduce a consistent hash, to introduce virtual node Vnode reason.

Consistent hash diagram:

Back to the point, in spark how to solve the problem of data skew. First of all, to clarify the occurrence of this problem and the root causes: Generally speaking, are (key, value) data, key distribution is uneven, this scenario is more common method is the key to salt treatment [do not know how salt Chinese should say], for example, there are 2 key (Key1 , Key2), and key1 corresponding data set is very large, and key2 corresponding data set is relatively small, you can expand the key into multiple key (Key1-1, Key1-2, ..., Key1-n, key2-1), and ensure key1-* Corresponding data are the original key1 corresponding data set on the partition, the corresponding data on thekey2-* is the original key2 corresponding data set on the partition. After this, we have m+n key, and each key corresponding dataset is relatively small, the degree of parallelism increased, each parallel program processing data set size difference is not large, can greatly speed up the parallel processing efficiency. This approach is mentioned in both of these two shares:

Top 5 Mistakes when writing Spark applications

Sparkling:speculative Partition of Data for Spark Applications-peilong Li 8. Avoid Cartesian operation

The Rdd.cartesian operation is time-consuming, especially when the dataset is large, the order of magnitude of the Cartesian is square-level, both time-consuming and space consuming.

>>> Rdd = Sc.parallelize ([1, 2])
>>> sorted (Rdd.cartesian (RDD). Collect ())
[(1, 1), (1, 2), (2 , 1), (2, 2)]
9. Avoid shuffle when possible

The shuffle in spark defaults to writing the last stage data to disk, and then the next stage reads the data from disk. Disk IO here can have a big impact on performance, especially when the data is large. Use Reducebykey instead the Groupbykey when possible

Use treereduce instead the reduce when possible

Use Kryo Serializer

In spark applications, when the RDD is shuffle and cache, data needs to be serialized to be stored, and data serialization can be an application bottleneck in addition to IO. It is recommended that the Kryo sequence library be used to ensure high serialization efficiency in data serialization.

sc_conf = sparkconf ()
sc_conf.set ("Spark.serializer", "Org.apache.spark.serializer.KryoSerializer")
Reference Articles

Chapter 4 of Learning Spark

Chapter 8 of Learning Spark

Top 5 Mistakes when writing Spark applications

Databricks Spark Knowledge Base

Sparkling:speculative Partition of Data for Spark Applications-peilong Li

Fighting the skew in Spark

Tuning and Debugging Apache Spark

Tuning Spark Avoid Groupbykey

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.