Many documents describe the number of mapper that cannot be directly controlled by default because the number of mapper is determined by the size and number of inputs. By default, how many mapper should the final input occupy? If you enter a large number of files, but the size of each file is smaller than the HDFs blocksize, then the startup mapper equals the number of files (that is, each file occupies a block), it is likely to cause the number of boot mapper exceeded the limit caused by the crash. These logic are true, but they are all logical by default. In fact, if you do some customized settings, you can control.
In Hadoop, the number of set map tasks is not as straightforward as setting the number of reduce tasks: it is not possible to tell Hadoop directly and accurately how many map tasks should be started.
You may be surprised that the API does not provide an interface org.apache.hadoop.mapred.jobconf.setnummaphttp://www.aliyun.com/zixun/aggregation/17034. HTML ">tasks" (int n)? Is this value not allowed to set the number of map tasks? This API is true, and the document explains "Note:this is only a hint to the framework.", That is, this value is only a hint to the Hadoop framework and does not play a decisive role. That is, even if you set it, it doesn't necessarily get the effect you want.
1. InputFormat Introduction
Before setting the number of map tasks, it is important to understand the basics associated with map-reduce input.
This interface (ORG.APACHE.HADOOP.MAPRED.INPUTFORMAT) describes the input specification (input-specification) for the map-reduce job. It divides all the input files into logical inputsplit, each inputsplit will be divided into a separate mapper, and it also provides a specific implementation of Recordreader, which retrieves input from the logical inputsplit Records and passed to mapper for processing.
InputFormat has a variety of specific implementations, such as Fileinputformat (the underlying abstract class that handles file-based input), Dbinputformat (processing database based input, data from a table that can be queried in SQL), Keyvaluetextinputformat (special Fineinputformat, processing plain Text file, files are divided into rows by carriage Returns or carriage return line breaks, Each row is divided into key and value by Key.value.separator.in.input.line, Compositeinputformat,delegatinginputformat etc. Fileinputformat and their subtypes are used in most scenarios.
Through the simple introduction above, we know that InputFormat determines inputsplit, each inputsplit is assigned to a separate mapper, so InputFormat determines the specific map task number.
2. Factors affecting map quantity in Fileinputformat
In daily use, Fileinputformat is the most commonly used inputformat, it has a lot of concrete implementation. The following factors that affect the map quantity are valid only for Fileinputformat and its subclasses, and other fileinputformat can see the corresponding getsplits (jobconf job, int numsplits) specific implementation.
See the following code snippet (excerpt from org.apache.hadoop.mapred.fileinputformat.getsplits,hadoop-0.20.205.0 source code):
TotalSize: Is the total size of all input for the entire map-reduce job.
Numsplits: From Job.getnummaptasks (), that is, the value set with org.apache.hadoop.mapred.JobConf.setNumMapTasks (int n) at job startup, A hint to the map number of the m-r frame.
Goalsize: Is the ratio of the total size of the input to the number of map tasks, that is, how much data is expected to be processed by each mapper, just expectations, the number of data processed by the following computesplitsize decision.
Minsplitsize: The default is 1, which can be reset by a subclass-class replication function protected void setminsplitsize (Long minsplitsize). In general, it is 1, except in exceptional circumstances.
MinSize: The larger of the 1 and mapred.min.split.size.
Blocksize:hdfs block size, the default is 64M, the general large HDFs are set to 128M.
Splitsize: Is the final size of each split, then the number of map is basically totalsize/splitsize.
Let's take a look at the logic of Computesplitsize: First of all, goalsize (expect the amount of data processed by each mapper) and HDFs block size to be smaller, then larger than the mapred.min.split.size.
3. How to adjust the number of maps
With a 2 analysis, it's easy to adjust the number of maps below.
3.1 Reduce the number of mapper created when Map-reduce job starts
When working with large quantities of larger data, a common scenario is that the number of mapper in the job boot is too large to exceed the system limit, causing Hadoop to throw an exception to terminate execution. The way to solve this anomaly is to reduce the number of mapper. Specifically as follows:
3.1.1 Input file size is large, but not small files
This can reduce the number of mapper required by increasing the input size of each mapper, that is, increasing minsize or increasing blocksize. Increasing blocksize is usually not a good result, because when HDFs is Namenode-format by Hadoop, BlockSize is determined (dfs.block.size by format), and if you want to change blocksize, You need to reformat the HDFs, which will, of course, lose the existing data. Therefore, it is usually possible to increase the value of mapred.min.split.size only by increasing the minsize.
3.1.2 The number of input files is large and small files
Small files, that is, the size of a single file is less than blocksize. This situation by increasing mapred.min.split.size is not OK, you need to use Fileinputformat derived Combinefileinputformat to add multiple input Path is merged into a inputsplit to mapper processing, thereby reducing the number of mapper. Details will be updated and expanded later.
3.2 Increase the number of mapper created when Map-reduce job starts
Increasing the number of mapper can be achieved by reducing the input of each mapper, that is, reducing the blocksize or decreasing the mapred.min.split.size value.
Original link: http://blog.csdn.net/yishao_20140413/article/details/24932655