Hive optimization "Increase number of executed maps, reduce"

Source: Internet
Author: User

The amount of map data that is launched in hive, and the amount of reduce data, are controlled by the system, and in general, the amount of data in the map is determined by the number of files and the size of the file. If you have a lot of files, then each file will have to start a map for processing, or your file is very large, is HDFs block_size n times, then it will be divided into n files, the same will start n map for processing. The amount of reduce data is based on how much your key determines, once your map generated a lot of key, then reduce the amount of data will also be more, it is best to avoid such as order by the global statistical parameters, because such a function is usually only a reduce for processing.

Map more and less, reduce more and less time difference is very obvious, the time difference between the 3 map and the 30 map is basically 10 times times, which is why we do our best to increase the number of maps, and in order to achieve this increase in the number of maps, there are usually two ways, one is to maximize the size of the file , to become block_size n times, there is another way is to split large files into small files, each small file processing time, will start a map. The former is usually worded as follows: Create TABLE Cajeep_test2 as Selecta.*,dummy_string dummy_string from TDL_EN_PP_NODE_STREAM_TMP2, Increase the size of the data file by increasing the dummy_string garbage field to achieve the function of dividing the file into multiple files; the latter requires a set mapred.reduce.tasks = 30;set hive.merge.mapredfiles= false; By using these two settings, the system will start 30 each time that reduce is started, and the files generated by reduce are not merged. Because each reduce generates a file, the light settings are not enough, you need to let SQL produce the reduce process, and it is not global reduce (order by, and so on), unless there is a group by such code that can guarantee the creation of more than one reduce, The usual code needs to be followed by the addition of distribute by RAND (123), which automatically generates a reduce process that automatically calls 30 reduce when the reduce process is generated, thereby generating 30 files.

The specific research can look at this article:http://hugh-wangp.iteye.com/blog/1579192 Inside said more detailed, basically is according to inside realization.

With the ability to control the number of maps and reduce at will, the quality of the hive code is much better, the time of each script is shortened from 70 minutes to 20 minutes, and each SQL is no longer a pathetic 2 map. It's 20 maps.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.