1. Optimization? Why? How? When? What?
"Spark applications also need to be optimized. "Many people may have this question," not already have code generators, executive optimizer, pipeline or something. ”。 Yes, Spark does have some powerful built-in tools to make your code faster when it executes. But if everything depends on the tools, framework to do, I think that can only illustrate two questions: you are only aware of the framework, but not the reason why; it seems you are only divert, without you, others can easily write such a spark application, so you are replaceable;
When doing spark application optimizations, it's enough to start at the following points: 2.repartition and coalesce
Original
Spark provides the ' repartition () ' function, which shuffles the data
across the network to create a new set of Partiti Ons. Keep in mind that
repartitioning your The data is a fairly expensive. Spark
also has a optimized version of ' repartition () ' called ' Coalesce () '
that allows avoiding data movement, but Only if you are decreasing the number of
RDD partitions. To know whether can safely call
coalesce (), can check the size of the RDD using ' rdd.partitions.size () ' in
Java/scala and ' rdd.getnumpartitions () ' in Python and make sure this you
are coalescing it to fewer partitions I T currently has.
Summary : When you want to RDD, if the target area is smaller than the current number of pieces, then use coalesce, do not use repartition. For more optimization details on partition, refer to Chapter 4 of Learning Spark 3.Passing functions to Spark
In Python, we have three the options for passing functions into Spark. Lambda expressions
Word = Rdd.filter (lambda s: "error" in s)
Lambda expressions
Import my_personal_lib
word = rdd.filter (my_personal_lib.containserror)
Locally defined Functions
def containserror (s): Return
"error" in s
word = Rdd.filter (containserror)
One issue to watch out as when passing functions is inadvertently serializing the object containing the function. When you pass a function of "is" the member of object, or contains references to fields in object (e.g., Self.field) , Spark sends the entire object to worker nodes, which can is much larger than the bit of information. Sometimes this can also cause your program to fail, if your class contains objects that Python can ' t figure out how to pic Kle.
### wrong way
class Searchfunctions (object):
def __init__ (self, query):
self.query = Query
def ismatch (self, s): Return
self.query in S-
def getmatchesfunctionreference (self, Rdd):
# problem:references all of "s Elf "in" Self.ismatch "return
rdd.filter (self.ismatch)
def getmatchesmemberreference (self, Rdd):
# Problem:references all of ' self ' in ' self.query ' return
rdd.filter (lambda x:self.query into x)
### the right way
class Wordfunctions (object): ...
def getmatchesnoreference (self, Rdd):
# safe:extract only the field we need to a local variable
query = self. Query return
rdd.filter (lambda x:query in x)
4.worker Resource allocation: CPU, Memroy, executors
This topic is quite deep, and it's not the same in different deployment modes [standalone, yarn, Mesos]. The only purpose is not to consider all the resources are independent to spark to use, to take into account the machine itself some processes, spark dependent on some of the processes, network conditions, task conditions [compute-intensive, IO-intensive, long-live task] and so on.
Here can only recommend some video,slide and blog, specific situation specific analysis, later I encountered resource tuning when the actual case sent out.
Top 5 Mistakes when writing Spark applications 5.shuffle block size limitation
No Spark Shuffle block can be greater than 2 GB. The block size in spark shuffle cannot be greater than 2g.
Spark uses a data structure called Bytebuffer as a cache of shuffle data, but the Bytebuffer default allocated memory is 2g, so the shuffle process can go wrong once the Shuflle data exceeds 2g. The factors that affect the size of the shuffle data are the following common several:
The number of partition, the more partition, the less data distributed to each partition, the more difficult it is to cause shuffle data to be too large;
Data distribution is uneven, generally after Groupbykey , there are some key contains data is too large, resulting in the key on the partition data is too large, it is possible to trigger later Shuflle block larger than 2g;
The general solution to this kind of method is to increase the number of partition, top 5 mistakes when writing Spark applications here that can be expected to make each partition on the data for 128MB, only for reference, or need specific field Landscape specific analysis, here only the principle of the line, and do not have a perfect specification. specifies a larger partition number when Sc.textfile spark.sql.shuffle.partitions rdd.repartition rdd.coalesce
TIPS:
In two scenarios where partition is less than 2000 and greater than 2000, Spark uses different data structures to record relevant information at shuffle time, and when partition is greater than 2000, there is another more efficient [compressed] data structure to store the information. So if your partition is not 2000, but very close to 2000, you can safely set the partition to more than 2000.
def apply (Loc:blockmanagerid, Uncompressedsizes:array[long]): Mapstatus = {
if (Uncompressedsizes.length > 2000 {
highlycompressedmapstatus (loc, uncompressedsizes)
} else {
new Compressedmapstatus (Loc, uncompressedsizes)
}
}
6. level of Parallel-partition
Let's take a look at some of the performance metrics for all task runs in a stage, some of which are described:
Scheduler Delay:spark The time it takes to assign a task
Executor Computing time:executor time spent executing a task
getting result time: Get the task execution results
result serialization Time:task execution results serialization time
Task deserialization Time:task deserialization time
Shuffle Write Time:shuffle writes data time Shuffle Read Time:shuffle time spent reading data
And the level of parallel, in fact, in most cases refers to the number of partition, partition number of changes will affect the changes in the above several indicators. When we tune, we often look at the above indicators change. When the partition changes, the above indicators change as follows: partition too small [easy to introduce data skew problem]
Scheduler Delay: No significant changes Executor Computing time: Unstable, there are large and small, but the average down relatively big getting result time: unstable, there are large and small, but the average down relatively big result Serializat Ion time: Unstable, there are large and small, but the average down relatively big Task deserialization time: Unstable, there are big and small, but the average down relatively large Shuffle write time: Unstable, there are big and small, but the average down relatively large Shuffle Rea D Time: Unstable, there are big and small, but the average down relatively large partition too big
Scheduler Delay: No significant changes Executor Computing time: Relatively stable, average down relatively small getting result time: relatively stable, average down compared to small result serialization time: Relatively stable, average down compared to small Task deserialization time: relatively stable, average down relatively small Shuffle Write time: More stable, the average down relatively small Shuffle Read time: Relatively stable, average down relatively small
So how do you set the number of partition? There is also no specific formula and specifications, generally after a few attempts to have a more optimized results. But the aim is: try not to cause data skew problem, try to make each task execution time in a small change in the interval. 7. Data Skew
Most of the time, the benefits of the distributed computing we want to do should be like the following:
However, sometimes, it is the following effect, which is called the data skew. That is, the data is not evenly distributed to the cluster, so that for a task, the execution time of the entire task depends on the time the first block of data is processed. In many distributed systems, data skew is a big problem, for example, a distributed cache, assuming that there are 10 cache machines, but 50% of the data fall on one of the machines, then when the machine down, the entire cached data will be discarded, the cache hit rate at least [certainly greater than] Reduced by 50%. This is also a lot of distributed cache to introduce a consistent hash, to introduce virtual node Vnode reason.
Consistent hash diagram:
Back to the point, in spark how to solve the problem of data skew. First of all, to clarify the occurrence of this problem and the root causes: Generally speaking, are (key, value) data, key distribution is uneven, this scenario is more common method is the key to salt treatment [do not know how salt Chinese should say], for example, there are 2 key (Key1 , Key2), and key1 corresponding data set is very large, and key2 corresponding data set is relatively small, you can expand the key into multiple key (Key1-1, Key1-2, ..., Key1-n, key2-1), and ensure key1-* Corresponding data are the original key1 corresponding data set on the partition, the corresponding data on thekey2-* is the original key2 corresponding data set on the partition. After this, we have m+n key, and each key corresponding dataset is relatively small, the degree of parallelism increased, each parallel program processing data set size difference is not large, can greatly speed up the parallel processing efficiency. This approach is mentioned in both of these two shares:
Top 5 Mistakes when writing Spark applications
Sparkling:speculative Partition of Data for Spark Applications-peilong Li 8. Avoid Cartesian operation
The Rdd.cartesian operation is time-consuming, especially when the dataset is large, the order of magnitude of the Cartesian is square-level, both time-consuming and space consuming.
>>> Rdd = Sc.parallelize ([1, 2])
>>> sorted (Rdd.cartesian (RDD). Collect ())
[(1, 1), (1, 2), (2 , 1), (2, 2)]
9. Avoid shuffle when possible
The shuffle in spark defaults to writing the last stage data to disk, and then the next stage reads the data from disk. Disk IO here can have a big impact on performance, especially when the data is large. Use Reducebykey instead the Groupbykey when possible
Use treereduce instead the reduce when possible
Use Kryo Serializer
In spark applications, when the RDD is shuffle and cache, data needs to be serialized to be stored, and data serialization can be an application bottleneck in addition to IO. It is recommended that the Kryo sequence library be used to ensure high serialization efficiency in data serialization.
sc_conf = sparkconf ()
sc_conf.set ("Spark.serializer", "Org.apache.spark.serializer.KryoSerializer")
Reference Articles
Chapter 4 of Learning Spark
Chapter 8 of Learning Spark
Top 5 Mistakes when writing Spark applications
Databricks Spark Knowledge Base
Sparkling:speculative Partition of Data for Spark Applications-peilong Li
Fighting the skew in Spark
Tuning and Debugging Apache Spark
Tuning Spark Avoid Groupbykey