Spark & spark Performance Tuning practices

Source: Internet
Author: User
Tags spark rdd

Spark is especially suitable for multiple operations on specific data, such as mem-only and MEM & disk. Mem-only: high efficiency, but high memory usage, high cost; mem & Disk: After the memory is used up, it will automatically migrate to the disk, solving the problem of insufficient memory, it brings about the consumption of Data replacement. Common spark tuning workers include nman, jmeter, and jprofile. The following is an example of spark tuning:

1. Scenario: precise customer base

Optimize the query of a 1800 GB Customer Information table on spark. The large Wide Table has more than columns and 20 columns are used effectively.

2. optimized results: the query speed is reduced from 40.232s to 2.7 s.

3. Optimization process analysis

Step 1: First, we found that a large number of iowait instances exist on the disk. By checking the relevant log files, we found the size of a block and calculated that the size of the entire data file is GB, and the entire memory cannot be accommodated, the compression method is used for optimization. Combined with the characteristics of the data file, there are a large number of 0 and 1. The gzip algorithm is used for compression, and the size after compression is 1.9 GB, in this step, the query is reduced from 40.232 to 20.12 S.

Step 2: A large Wide Table has more than 1800 columns, but only 20 columns are used effectively. Therefore, rcfile only loads valid columns. In this step, the query speed is reduced from 20 s to 12 s.

Step 3: Jprofile is used to analyze why the CPU load is too high and find out the serialization mechanism is faulty. Spark has two serialization frameworks: Java and kryo. Kryo is a fast and efficient Java object graphics serialization framework. It features high performance, efficiency, and ease of use. After switching to kryo, the number of queries dropped from 12 s to 7 s.

Step 4: Further analysis of the uneven load of each CPU core, the memory is not full, and the system resources are not fully utilized. How should we use it? (1) The number of partition tasks created by Spark RDD corresponds to the number of partition tasks. (2) The number of partition is determined by the number of blocks in hadoop RDD. Memory: total system memory size = work memory size * Work size = spark_worker_memory * spark_worker_instances;

CPU: Total number of tasks in the system = number of jobs × Number of cores occupied by work = spark_worker_instances * spark_worker_cores, calculation of task concurrency, memory allocation, tuning parameters:

Spark_worker_instances = 4

Spark_worker_cores = 3

Spark_worker_memory = 6g

CPU (12 core) MEm (24g), through the optimization of these parameters, the query is reduced from 7 s to 5 s.

Step 5: Obviously fullgc occurs on the sharkserver side.

Export shark_master_mem = 2g, this step is reduced from 6 s to 3sl;

Step 6: The CPU bottleneck occurs when two tables are associated. The reason for the analysis is that the daily table is compressed by gzip. optimization method: the daily table is made into a memory table without gzip compression. The number of queries dropped from 3 s to 2 s.

4. Summary

Optimization is a process of gradual refinement. Review the optimization process mainly from the following considerations: (1) MEM; (2) CPU; (3) DIS; (4) network IO; (5) serialization mechanism. Consider these factors as the main line, and make a bold attempt to explore its related content.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.