Spark's solution to oom problem and its optimization summary

Source: Internet
Author: User
Tags shuffle

The oom problem in Spark is the following two scenarios
    • Memory overflow in map execution
    • Memory overflow after shuffle
The memory overflow in map execution represents the operation of all map types, including: Flatmap,filter,mappatitions, and so on. The shuffle operation for memory overflow after shuffle includes operations such as Join,reducebykey,repartition. After summarizing my understanding of the spark memory model, I summarize the solution and performance optimization of various oom scenarios. If you understand the error, you want to point it out in the comments.
Spark memory Model:Spark's memory in a executor is divided into three blocks, a piece of execution memory, a piece of storage memory, and a piece of other memory.
    • Execution memory is the execution of memory, the document said that join,aggregate in this part of the memory execution, shuffle data will be in the memory of the first, full and then write to disk, can reduce IO. In fact, the map process is also executed in this memory.
    • Storage memory is where broadcast,cache,persist data is stored.
    • Other memory is the memory that is reserved to itself when the program executes.
Execution and storage are large memory in spark executor, and other uses a lot less memory, this is not said. In previous versions of spark-1.6.0, the memory allocations for execution and storage were fixed and the parameter configurations used were spark.shuffle.memoryFraction (execution memory accounted for executor total memory size, default 0.2) and spark.storage.memoryFraction (storage memory account for executor memory size, default 0.6), Because it was 1.6.0 before these two pieces of memory are isolated from each other, which leads to the executor memory utilization is not high, and the need to adjust these two parameters according to application situation, the user to optimize the use of spark memory.    In versions above spark-1.6.0, execution memory and storage memory can be borrowed from each other, increasing the memory usage of memory in spark, while also reducing the situation of oom. Added out-of-heap memory after Spark-1.6.0, further optimizing Spark's memory usage, out-of-heap memory using memory outside the JVM heap, not being recycled by GC, reducing the frequency of full GC, so in spark programs, Long stay. The large memory objects in the Spark program can use out-of-heap memory storage. There are two ways to use out-of-heap memory, one is to pass in the parameter storagelevel.off_heap when the RDD calls persist, which needs to be used in conjunction with Tachyon.    The other is to use the spark.memory.offHeap.enabled configured with spark to be true, but this is not supported in the 1.6.0 version, but this parameter is more open in later versions. The problem with oom usually occurs in the memory of execution, because storage this memory will discard the old data in memory directly after the data is full, but will not have the problem of oom for performance.
Memory Overflow Workaround:
1. The map process produces a large number of objects causing memory overflow:The reason for this overflow is caused by a large number of objects in a single map, for example: Rdd.map (X=>for (i <-1 to 10000) yield i.tostring, which in the RDD, each object produces 10,000 objects, This is sure to cause a memory overflow problem. For this kind of problem, without increasing the memory, it can be loaded by reducing the size of each task, so that the memory of each task is executor even if it produces a large number of objects. It is possible to call the Repartition method before a map operation that produces a large number of objects, partitioning into smaller chunks into the map.    For example: Rdd.repartition (10000). Map (X=>for (i <-1 to 10000) yield i.tostring). In the face of this problem, you can not use the Rdd.coalesce method, this method can only reduce the partition, cannot increase the partition, there is no shuffle process.
2. Data imbalance causes memory overflow:Data imbalance in addition to the possibility of causing memory overflow, but also can cause performance problems, the workaround is similar to the above, is called Repartition repartitioning. This is no longer a liability.
A 3.coalesce call causes a memory overflow:This is a problem I have recently encountered, because HDFS is not suitable for small problems, so after spark calculation if the resulting file is too small, we will call the coalesce merge file and then into HDFs. However, this can lead to a problem, such as having 100 files before coalesce, which also means that there are 100 tasks that can now be called COALESCE (10), and then only 10 files are generated, because coalesce is not a shuffle operation, This means that coalesce does not execute 100 tasks as I originally thought, and then merges the result of the task into 10, but only 10 tasks are executed from the beginning, the original 100 files are executed separately, and now each task reads 10 files at a time. The memory used was 10 times times the original, which led to the oom. The solution to this problem is to enable the program to execute 100 task and then merge the result into 10 files according to our thinking, this problem can also be solved by repartition, call Repartition (10), because there is a shuffle process, Before and after the shuffle is two stage, a 100 partition, one is 10 partitions, can follow our idea executes.
memory Overflow after 4.shuffle:Shuffle memory overflow situation can be said to be shuffle after the single file is too large caused. In Spark, Join,reducebykey this type of process, there will be shuffle process, in the use of shuffle, need to pass in a partitioner, most of the shuffle operations in Spark, The default partitioner are Hashpatitioner, the default is the maximum number of partitions in the parent RDD, which is controlled by the spark.default.parallelism ( In the Spark-sql with Spark.sql.shuffle.partitions), The spark.default.parallelism parameter is valid only for Hashpartitioner, so if it is a different partitioner or partitioner of its own implementation, you cannot use the Spark.default.parallelism parameter to control s Huffle the concurrency of the volume. If it is a shuffle memory overflow caused by another partitioner, you need to increase the number of partitions from the Partitioner code.
5. Memory overflow due to uneven resource allocation in standalone mode:In the standalone mode, if the--total-executor-cores and--executor-memory parameters are configured, but there is no configuration--executor-cores This parameter, it is possible to cause Each executor memory is the same, but the number of cores is different, so in the cores number of executor, due to the ability to execute multiple tasks at the same time, it is easy to lead to an overflow situation. The solution to this situation is to configure the--executor-cores or Spark.executor.cores parameters at the same time to ensure that the executor resources are evenly distributed.
6. In the RDD, a common object can reduce the situation of Oom:This is quite special, and here's a record, and in a situation like this rdd.flatmap (X=>for (I <-1 to) yield ("key", "value") causes oom, but in the same case, Using Rdd.flatmap (X=>for (I <-1 to) yield "key" + "value" does not have an oom problem, because each time ("key", "value") produces a tuple object, and "key" + " Value ", regardless of how many, has only one object, pointing to the constant pool. The specific tests are as follows:
This example shows that ("Key", "value") and ("Key", "value") exist in memory in different locations, that is, two saved, but "key" + "value" although the occurrence of two times, but only one copy, at the same address, This uses the knowledge of the JVM constant pool. Thus, if there is a large amount of duplicate data in the RDD, or we can convert the duplicate data to a string when there is a large amount of duplicate data in the array, it can effectively reduce the memory usage.
Optimization:This part of the main record to the spark-1.6.1 version, I think there are some optimization performance function of some parameter configuration and some code optimization techniques, in the Parameter optimization section, if I think the default value is the best, here is no longer recorded. code optimization Tips: 1. Use mappartitions instead of most map operations, or continuous map operations:Here's a little talk about the difference between Rdd and dataframe. The RDD emphasizes immutable objects, each of which is immutable, and when invoking the RDD's map type operation produces a new object, which leads to the problem that if a large number of map types are invoked on an RDD, each map operation produces one to multiple Rdd objects, This may not necessarily result in a memory overflow, but it produces a large amount of intermediate data, which increases the GC operation. In addition, the RDD will start the division of the stage when the action is invoked, but the parts that are optimized within each stage are not optimized, such as Rdd.map (_+1). Map (_+1), which is equivalent to Rdd.map in a numeric Rdd (_+2 ), but the RDD does not optimize the process internally. Dataframe is different, dataframe because of the type information is mutable, and in a program that can use SQL, except for the interpreter, there will be an SQL optimizer, Dataframe is no exception, there is an optimizerCatalyst, specific introduction look at the back ReferenceThe article. The drawbacks of these rdd mentioned above, some of which can be optimized using mappartitions, mappartitions can replace Rdd.map,rdd.filter at the same time, Rdd.flatmap function, so in the long operation, can be in the mappartitons of the Rdd a large number of operations together to avoid the creation of a large number of intermediate RDD objects, in addition to mappartitions in a partition can be reused variable type, which also avoids the frequent creation of new objects. The disadvantage of using mappartitions is that the readability of the code is sacrificed.
2.broadcast Join and normal join:In large data distributed systems, the impact of large data movement on performance is also enormous. Based on this idea, when a join operation is performed on two rdd, if one of the RDD is relatively small, the small rdd can be collect and then set to the broadcast variable, so that the other rdd can join with the map operation. This will effectively reduce the data movement of the more large rdd.
3. Filter first in join:This is the predicate push, it is obvious that the filter after the join,shuffle of the amount of data will be reduced, here is to mention that the Spark-sql optimizer has been optimized for this part, do not need the user to display the operation, the individual implementation of the RDD calculation need to pay attention to this.
4.partitonBy Optimization:This section is detailed in another article, "Spark partitioner use Tips", which is not mentioned here.
5. Use of Combinebykey:This operation is also available in Map-reduce, here is an example: Rdd.groupbykey (). Mapvalue (_.sum) is less efficient than Rdd.reducebykey for the following two images (online theft, invasion and deletion)


The difference between the upper and lower picture is that the combinebykey process reduces the amount of shuffle data, the following is not. Combinebykey is the Key-value-type RDD API that can be used directly.
6. Use of Rdd.persist (Storagelevel.memory_and_disk_ser) instead of Rdd.cache ():Rdd.cache () and Rdd.persist (storage.memory_only) are equivalent, when memory is low, rdd.cache () data is lost, re-used will be re-calculated, and Rdd.persist ( Storagelevel.memory_and_disk_ser) is stored in the disk when the memory is insufficient, avoids the re-calculation, just consumes the point IO time.
7. When Spark uses HBase, Spark and HBase are built in the same cluster:In the use of Spark combined with hbase, Spark and HBase are best built on the same cluster, or Spark's cluster nodes can overwrite all hbase nodes. Data in HBase is stored in hfile, usually a single hfile is larger, and when spark reads hbase data, it does not follow a hfile corresponding to an RDD partition, but a region corresponding to an RDD partition. So when Spark reads HBase data, a single rdd is usually larger, and if it's not built in the same cluster, data movement can take a lot of time.
parameter Optimization section: 8. Spark.driver.memory (default 1g):This parameter is used to set the driver memory. In the Spark program, Sparkcontext,dagscheduler are run on the driver side. Corresponding to the RDD stage segmentation is also running on the driver side, if the user wrote the program has too many steps, cut out too much of the stage, this part of the information consumption is driver memory, this time you need to adjust the driver memory.
9. Spark.rdd.compress (Default false):This parameter can be set to True when the memory is tight and the persist data has good performance, so it is possible to compress the RDD data in memory when using persist (Storagelevel.memory_only_ser). Reduce memory consumption, that is, when the use of the CPU will consume the decompression time.
Spark.serializer (default Org.apache.spark.serializer.JavaSerializer)Recommended setting to Org.apache.spark.serializer.KryoSerializer, because Kryoserializer is faster than Javaserializer, but it is possible that some object will fail to serialize, this time it is necessary to display the class that failed the serialization Kryo Serializer registration, this time to configure the Spark.kryo.registrator parameter or use the following code: Val conf= NewSparkconf (). Setmaster (...). Setappname (...)
Conf.registerkryoclasses (Array (classof[myclass1) , Classof[myclass2]))
Valsc = Newsparkcontext (conf)
spark.memory.storageFraction (default 0.5)This parameter sets memory to represent executor in-memory storage/(storage+execution), although Spark-1.6.0+ 's version of memory storage and execution memory is already available to borrow from each other, But borrowing and redeeming is also required to consume performance, so if you know that storage is more or less, you can adjust this parameter.
12.spark.locality.wait (default 3s):There are 4 localization execution level,process_local->node_local->rack_local->any in Spark, a task is executed, Wait spark.locality.wait time if, for the first time, the task that waits for the process arrives, if not, wait for the job's rank down to node and wait for spark.locality.wait time, and so on, to know any. It is also very important that distributed systems perform well on the performance of local files. If the single partition of the RDD is too large and the processing time of a single partition is too long, the spark.locality.wait should be properly tuned to allow the task to have more time to wait for local data.
spark.speculation (Default false):In a large cluster, the performance of each node will vary, spark.speculation This parameter indicates whether the idle resource node will attempt to execute a task that is still running and run too long, to avoid a single node running too slowly causing the entire task to be stuck on one node. This parameter is best set to true. The parameters that fit together can be set with parameters that start with spark.speculation.x. ReferenceThis parameter is described in detail in the article.
Later have encountered new content to add.
Reference:1. Http://www.jianshu.com/p/c0181667daa02. http://www.csdn.net/article/2015-06-18/28249583. Https://chenzhongpu.gitbooks.io/bigdatanotes/content/SparkSQLOptimizer/index.html4. Http://book.51cto.com/art/201409/453045.htm
Reprint please maintain completeness and indicate source link:http://blog.csdn.net/yhb315279058

Spark's solution to oom problem and its optimization summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.