Spark is especially suitable for multiple operations on specific data, such as mem-only and MEM & disk. Mem-only: high efficiency, but high memory usage, high cost; mem & Disk: After the memory is used up, it will automatically migrate to the disk, solving the problem of insufficient memory, it brings about the consumption of Data replacement. Common spark tuning workers include nman, jmeter, and jprofile. The following is an example of spark tuning:
1. Scenario: precise customer base
Optimize the query of a 1800 GB Customer Information table on spark. The large Wide Table has more than columns and 20 columns are used effectively.
2. optimized results: the query speed is reduced from 40.232s to 2.7 s.
3. Optimization process analysis
Step 1: First, we found that a large number of iowait instances exist on the disk. By checking the relevant log files, we found the size of a block and calculated that the size of the entire data file is GB, and the entire memory cannot be accommodated, the compression method is used for optimization. Combined with the characteristics of the data file, there are a large number of 0 and 1. The gzip algorithm is used for compression, and the size after compression is 1.9 GB, in this step, the query is reduced from 40.232 to 20.12 S.
Step 2: A large Wide Table has more than 1800 columns, but only 20 columns are used effectively. Therefore, rcfile only loads valid columns. In this step, the query speed is reduced from 20 s to 12 s.
Step 3: Jprofile is used to analyze why the CPU load is too high and find out the serialization mechanism is faulty. Spark has two serialization frameworks: Java and kryo. Kryo is a fast and efficient Java object graphics serialization framework. It features high performance, efficiency, and ease of use. After switching to kryo, the number of queries dropped from 12 s to 7 s.
Step 4: Further analysis of the uneven load of each CPU core, the memory is not full, and the system resources are not fully utilized. How should we use it? (1) The number of partition tasks created by Spark RDD corresponds to the number of partition tasks. (2) The number of partition is determined by the number of blocks in hadoop RDD. Memory: total system memory size = work memory size * Work size = spark_worker_memory * spark_worker_instances;
CPU: Total number of tasks in the system = number of jobs × Number of cores occupied by work = spark_worker_instances * spark_worker_cores, calculation of task concurrency, memory allocation, tuning parameters:
Spark_worker_instances = 4
Spark_worker_cores = 3
Spark_worker_memory = 6g
CPU (12 core) MEm (24g), through the optimization of these parameters, the query is reduced from 7 s to 5 s.
Step 5: Obviously fullgc occurs on the sharkserver side.
Export shark_master_mem = 2g, this step is reduced from 6 s to 3sl;
Step 6: The CPU bottleneck occurs when two tables are associated. The reason for the analysis is that the daily table is compressed by gzip. optimization method: the daily table is made into a memory table without gzip compression. The number of queries dropped from 3 s to 2 s.
4. Summary
Optimization is a process of gradual refinement. Review the optimization process mainly from the following considerations: (1) MEM; (2) CPU; (3) DIS; (4) network IO; (5) serialization mechanism. Consider these factors as the main line, and make a bold attempt to explore its related content.