Spark Research note 6th-Spark Programming Combat FAQ

Source: Internet
Author: User
Tags log4j

This article focuses on some of the typical problems I have encountered since using spark and how to solve them, hoping to help the students who meet the same problem.

1. Spark environment or configuration related

Q:in the Spark Client Profile spark-defaults.conf, how should spark.executor.memory and Spark.cores.max be configured properly?
A:before configuration, you need to have a basic understanding of the configuration of the core and memory of each node in the Spark cluster. For example, in a spark cluster of 100 machines, where each node is configured to be core=32 and MEMORY=128GB, you should be aware of the "harmony" of Cores.max and Executor.memory configurations when submitting applications to the cluster. In particular, you need to anticipate the amount of data and calculations involved in the application, and then maximize the crushing of the single core and memory to avoid the imbalance between core and memory configurations.
For example, if an application is configured with cores.max=1000 and executor.memory=512m, this configuration is obviously unreasonable, because only 32 cores,1000 cores per node are required to occupy 32 machines in a cluster. Each executor (that is, each node in the cluster) only requests 512MB of memory, which is the entire cores of the 32 machines occupied by the application, but consumes only 0.5G*32=16GB memory, which means that the remaining (128g-0.5g) *32=4080GB memory is wasted.
in short, a reasonable configuration should be to ensure that the cores and memory are as low as possible, so that the speed of the spark can meet business needs. When the actual configuration, memory can be configured as the maximum machine threshold, the number of cores according to the actual calculation of reasonable settings, to minimize the number of nodes occupied by the task.

Q:
consider the impact of this scenario on spark performance with regard to the data source and the spark cluster: The data source is on the HDFs cluster in the south, and the spark cluster is located in the northern engine room.
A:when the data source cluster is far away from the spark compute Cluster, spark reads the data in a very large network overhead and has a significant impact on the job run speed. Therefore, when building a cluster, try to keep the spark cluster close to the data source.

Q:
after submitting the task, run the error "Spark Java.lang.OutOfMemoryError:Java heap space", how to handle it?
A:This error is caused by many reasons, such as that the global variable in the code is too large (such as a large dictionary loaded) or rdd.collect () too large. The dispatch logic for Spark is a lazy evaluation policy for transformations operations, which means that the entire job is evaluated only when an action operations is encountered The transformations and action involved in the chain are output to the driver program, typically the terminal of the machine where the spark client is located, if the output of the actions is not a write disk. The default heap space for the driver JVM is hundreds of MB, and when the large data set returned by the action cannot be hold, the Oom is reported.
There are 2 ways of solving the problem:One idea is to modify the code logic, such as splitting a large variable into multiple variables after multiple steps (such as splitting a large dict or set into N small, after each step, destroying the current variable and constructing a new variable), this idea is a typical time-to-change space, but sometimes business logic may not be able to do this kind of split processing Another idea is to submit the spark application, reasonable configuration of relevant parameters, such as in spark-defaults.conf Add Configuration item spark.driver.memory 12G, specific configuration ideas can refer to stackoverflow this post.

Q:
when submitting the task, the Spark client error "No space left on device", as shown, how to resolve?
A:this is because during the run of the task, the Spark client may write temporary files on the machine, and by default it will be written to the/tmp directory, which can easily be filled with/TMP, resulting in an error.
The workaround is to explicitly specify the write path to the temporary file in the Spark client spark-defaults.conf:
Spark.local.dir/home/slvher/tools/spark-scratch

Q: After the spark client submits the task, the spark job log is output to the console by default, mixed with user print debug logs, inconvenient to debug, how can I configure the internal log of spark to be printed separately to a file?
A: spark Prints logs with log4j, while Log4j's print behavior can be controlled by creating log4j.properties in the Conf directory and configuring it (recommended copy conf/ Log4j.properties.template for Log4j.properties To configure log4j printing behavior), specific configuration method can refer to Log4j's official website documents, here do not repeat.

2. Spark application Programming related

Q: Consider this scenario: the application to be submitted contains multiple Python files, one of which is the main portal, the others are custom module files and they have dependencies between them. The error "Importerror:no module named XXX" When submitting a task through SPARK-SUBMIT (specified file to upload with--py-files parameter), how to solve?
A: the. py file uploaded via the--py-files parameter is only uploaded, and the module (s) defined in these files will not be added to the search path of the interpreter in the cluster node Python environment by default. The Spark document Submiiting applications actually describes the way the task was submitted under this scenario:
for Python, you can use the--py-files argument of Spark-submit to add. py,. zip or. egg files to is distributed with Your application. If you depend the multiple Python files we recommend packaging them into a. zip or. Egg.
So, the correct way to commit is to make a package (put these files in a directory, and create an empty piece named __init__.py) in the directory, except for the main portal, and pack them into a zip archives, Then upload this zip package via--py-files. Because the Python interpreter can handle the import scene of the ZIP archives by default, and because the uploaded zip is a package containing __init__.py, the Python interpreter on the cluster node machine automatically joins them in the search path to the module. , which solves the problem of import error.

Q: How can I see debug information printed through print in a function?
A: The debug information for the spark application in the function passed to the spark Operations API is not visible on the local driver program side, because these functions are executed on the cluster nodes, so the print information is printed to the node machine assigned to the job. You need to find the submitted app from SPARTK Master's HTTP viewing interface, and then go to the Application execution node stderr to see it.

Q: What is the difference between map and Flatmap in the Pyspark API?
A: in terms of function behavior, they all accept a custom function f, and then call this function for each element in the RDD. The key difference is the type of the custom function f: Map accepts a function that returns a normal value, and Flatmap accepts a function that must return a iterable value, that is, the return value must be available for the iteration, and Flatmap will iterable Value does the flat operation (iterates over the value and flat into list). You can use the following demo to understand.

The results of the execution on the spark cluster are as follows:

Here's a little explanation for each line of the output:
A. Line 18th of the code, RDD.FLATMAP accepts the function parameter is TEST_FLATMAP_V1, and the latter return value is an iterative generator object, so flatmap to the return value flat operation, generator Each element of object acts as an element of the final flattened result.
B. Line 19th of the code, RDD.FLATMAP accepts the function parameter is TEST_FLATMAP_V2, and the latter return value a set, because the set itself is iterable, so flatmap to set do flat operation, Each element in the set completes the element of the final flattened result.
Special Note: If the value of return does not support iterate (such as int), then Flatmap will be good. If you are interested, you can experiment with it yourself.
c. Line 20th of the code, because map does not require the return value of its function parameter is iterable, nor does the value of iterable do flat operation, it simply takes return value itself as an element of the final result, So its output is easy to understand.

Q: What are the considerations for the RDD to perform the cache?
A: about the principle of the Rdd Persist performance optimization, you can view persisting Rdd in Spark here, but not all persist/cache operations are positively related to spark performance. Before persist, it is best to follow these guidelines:
a) The RDD will be used multiple times to consider the cache
b) When the RDD needs the cache, try to cache the "near" algorithm's RDD instead of the raw data that the cache reads
c) When the Rdd of the cache is no longer in use, call Unpersist as soon as possible to release its occupied cluster resources (mainly memory)

Q: How can I access shared variables in functions passed to the spark transformations operation?
A: According to the official Website Programming Guide documentation (see here), when a function passed as a parameter to a spark operation (such as map or reduce) executes on the node of the remote machine, a copy of each variable used in the function (separate Copies) will also be copied to these nodes for function access, and if these variables are modified on the node, then these modifications will not be reversed back to the spark driver program, where the implementation of the business code should be guaranteed to be read-only by the provider. Because maintaining common, read/write shared variables between different tasks can reduce spark efficiency.
For example, the following code shows how to implement shared access with a global variable in a function passed to the spark operation:


Q: In addition to sharing variables through global variable, how does spark support sharing variables?
A: Spark also supports the broadcast variable and the accumulators of the two shared variables. Where broadcast allows developers to keep a read-only cache of variables on each node of the spark cluster, essentially, the broadcast variable is also a global variable, except that it is explicitly distributed by the developer to the cluster node, Instead, spark automatically copies the access to variables based on the functions called by each task. As the accumulators, as the name implies, it only supports the add operation, the specific syntax can refer to the Spark Programming Guide about the accumulators section of the instructions.

Q: What is the relationship between the broadcast variable and the general global variable? Where are their respective applications?
A: in fact, the broadcast variable is a global variable that can be used to share variables when executing functions in a distributed node. The common global variable is that the Spark Scheduler is responsible for copying to the cluster nodes as the spark dispatches the task, which means that if there are multiple task executions that require access to a global variable, the execution of each task has a variable copy process , while the broadcast variable is actively copied by the developer to the cluster node and will remain the cache until the user actively invokes unpersist or the entire spark job ends.
PS: In fact, even if the call to Unpersist does not immediately release the resource, it simply tells the Spark scheduler that the resource can be freed, as to when the real release is determined by the Spark Scheduler, see SPARK-4030.
Conclusion: If a shared variable is only used 1 times by a task, the common global variable is shared, and the broadcast method saves copy overhead if the shared variable is accessed by multiple tasks that are executed successively.
again: If you use the broadcast method to share a variable, the developer should actively call Unpersist to release the cluster resource when it is determined that the variable no longer needs to be shared.

3. Other precautions
Q:
are there any other considerations?
A: The most common questions mentioned above are the more important considerations for how to efficiently utilize the spark cluster when actually writing complex spark applications, and it is highly recommended to refer to Notes on Writing Complex Spark Applications this article (Google docs needs to go through the wall).

Resources
1. Stackoverflow:spark Java.lang.OutOfMemoryError:Java Heap Space
2. Spark doc:submitting Applications
3. Spark Programming guide:shared Variables
4. Spark issues: [SPARK-4030] "destroy" method in broadcast should is public
5. [Googledoc] Notes on Writing Complex Spark applications

========================= EOF =======================


Spark Research note 6th-Spark Programming Combat FAQ

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.