Spark & spark Performance Tuning practices

Last Update:2014-08-09 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark is especially suitable for multiple operations on specific data, such as mem-only and MEM & disk. Mem-only: high efficiency, but high memory usage, high cost; mem & Disk: After the memory is used up, it will automatically migrate to the disk, solving the problem of insufficient memory, it brings about the consumption of Data replacement. Common spark tuning workers include nman, jmeter, and jprofile. The following is an example of spark tuning:

1. Scenario: precise customer base

Optimize the query of a 1800 GB Customer Information table on spark. The large Wide Table has more than columns and 20 columns are used effectively.

2. optimized results: the query speed is reduced from 40.232s to 2.7 s.

3. Optimization process analysis

Step 1: First, we found that a large number of iowait instances exist on the disk. By checking the relevant log files, we found the size of a block and calculated that the size of the entire data file is GB, and the entire memory cannot be accommodated, the compression method is used for optimization. Combined with the characteristics of the data file, there are a large number of 0 and 1. The gzip algorithm is used for compression, and the size after compression is 1.9 GB, in this step, the query is reduced from 40.232 to 20.12 S.

Step 2: A large Wide Table has more than 1800 columns, but only 20 columns are used effectively. Therefore, rcfile only loads valid columns. In this step, the query speed is reduced from 20 s to 12 s.

Step 3: Jprofile is used to analyze why the CPU load is too high and find out the serialization mechanism is faulty. Spark has two serialization frameworks: Java and kryo. Kryo is a fast and efficient Java object graphics serialization framework. It features high performance, efficiency, and ease of use. After switching to kryo, the number of queries dropped from 12 s to 7 s.

Step 4: Further analysis of the uneven load of each CPU core, the memory is not full, and the system resources are not fully utilized. How should we use it? (1) The number of partition tasks created by Spark RDD corresponds to the number of partition tasks. (2) The number of partition is determined by the number of blocks in hadoop RDD. Memory: total system memory size = work memory size * Work size = spark_worker_memory * spark_worker_instances;

CPU: Total number of tasks in the system = number of jobs × Number of cores occupied by work = spark_worker_instances * spark_worker_cores, calculation of task concurrency, memory allocation, tuning parameters:

Spark_worker_instances = 4

Spark_worker_cores = 3

Spark_worker_memory = 6g

CPU (12 core) MEm (24g), through the optimization of these parameters, the query is reduced from 7 s to 5 s.

Step 5: Obviously fullgc occurs on the sharkserver side.

Export shark_master_mem = 2g, this step is reduced from 6 s to 3sl;

Step 6: The CPU bottleneck occurs when two tables are associated. The reason for the analysis is that the daily table is compressed by gzip. optimization method: the daily table is made into a memory table without gzip compression. The number of queries dropped from 3 s to 2 s.

4. Summary

Optimization is a process of gradual refinement. Review the optimization process mainly from the following considerations: (1) MEM; (2) CPU; (3) DIS; (4) network IO; (5) serialization mechanism. Consider these factors as the main line, and make a bold attempt to explore its related content.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More