One of the hive parameter-level optimizations controls the number of maps

Source: Internet
Author: User
1. Determining the number of maps

Generally, a job generates one or more maps through the input file;

MapThe main deciding factors of the number are: InputTotal number of files, inputFile Size and block set in the ClusterSize(You can run the set DFS. Block. SIZE command in hive. This parameter cannot be customized );

Principle of splitting the number of file blocks: if the file size is larger than the block size (128 MB), The file is split. If the value is smaller than, the file is treated as a block..

 

Example 1:

Suppose there is a file a in the input directory with the size of 780 mb. hadoop splits file a into seven blocks (6 blocks of MB and 1 block of 12 Mb, block is 128 MB), resulting in 7 maps;

 

Example 2:

Assume that the input directory contains three files, A, B, and C, which are 10 m, 20 m, and 130 m respectively. hadoop Splits them into four blocks (10 m, 20 m, 128 m, 2 m), resulting in 4 map numbers;

 

Two ways to control mapNumber: reduce MAPNumber and add MapQuantity

To reduce the number of maps, You can merge small files for the file source;

To increase the number of maps, you can control the number of reducers in the previous job (join multiple tables in an SQL statement will be divided into multiple mapreduce tables ), because the reduce output number of the previous job determines the map number of the job;

 

2. problems caused by a large or small number of maps

The number of maps is too large:

1) the output file of the map stage is too small to generate a large number of small files. When the next stage is used, small files need to be merged. If multiple small files are processed by reduce, A large number of reduce tasks are required;

2) large overhead for initializing and creating map;

3) if a task has many small files (much smaller than the block size of 128 MB), each small file will also be treated as a block and completed with a map task, however, the start and initialization time of a map task is much longer than the logic processing time, which can cause a great waste of resources. In addition, the number of maps that can be executed by a job is also limited.

 

If the number of maps is too small, the following error occurs:

1) The file processing or query concurrency is small, and the job execution time is too long;

2) When there are a large number of jobs, it is easy to block the cluster;

3) Frequent speculative execution;

 

Is it easy to ensure that each map processes nearly MB of file blocks?

The answer is not necessarily true. For example, if there is a very M file, it will normally be completed using a map, but this file only has one or two small fields, but there are tens of millions of records. If the logic of Map Processing is complicated, using a map task is certainly time-consuming.

 

3. Set map count: mapred. Map. Tasks

When the input file is large, the task logic is complex, and the map execution is very slow, you can consider increasing the number of maps to reduce the amount of data processed by each map, this improves the task execution efficiency;

The default value of map in hive is 2;

 

CASE Environment Description:

Hive. Merge. mapredfiles = true (the default is false, which can be configured in the hive-site.xml and merged after the entire job is executed)

Hive. Merge. mapfiles = true map merge after execution

Hive. Merge. Size. Per. Task = 256000000

Mapred. Map. Tasks = 2 the number of maps in hive is 2 by default.

By default, the merge of small files is true, while the combination of DFS. Block. Size and hive. Merge. Size. Per. Task makes the vast majority of merged files be around MB;

 

Case 1:

Now we assume there are three MB files, each of which is MB in size and the number of maps is not set;

The entire job has 6 maps, 3 of which process MB of data respectively, and 3 of which process 44 MB of data respectively;

Barrel EffectThe execution time of the map stage of the job is not the execution time of the shortest one map, but the execution time of the longest one map. Therefore, although three maps process only 44 MB of data, they can run quickly, but they still need to wait for the other three maps to process MB. Obviously, processing MB of three maps slows down the entire job.

 

Case 2:

If we set mapred. Map. Tasks to 6, let's take a look at the changes:

Goalsize = min (900 MB/6, 256 MB) = 150 MB

The entire job will also be allocated with 6 maps for processing. Each map processes MB of data, which is very even. No one will hold back and allocate resources rationally, the execution time is about 59% (150/256) of Case 1)

 

Case 3:

select data_desc, count(1), count(distinct id), sum(case when …), sum(case when ...),  sum(…) from a group by data_desc

 

If table A has only one file, and the size is 120 MB, but it contains tens of millions of records, it is time-consuming to use a map to complete this task. In this case, we should consider splitting this file into multiple files, so that we can use multiple map tasks to complete it.

set mapred.reduce.tasks=10;create table a_1 as select * from a  distribute by  rand(123); 

 

In this way, the records of Table A will be randomly distributed to the_1 table containing 10 files, and then replaced by A_1 in the preceding SQL Table A, then 10 map tasks will be used to complete the process. Each map task processes data larger than 12 Mb (millions of records), which is much more efficient.

 

4. Summary of map count

It seems that there are some contradictions between these two types. One is to merge small files, and the other is to split large files into small files,This is the focus of attention.According to the actual situation, two principles must be followed to control the map quantity:

1)Make use of suitable map for large data volumesQuantity;

2)Make a single mapTask Processing appropriate data volume;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.