Analysis of three problems of data partition, split scheduling and data reading of InputFormat

Last Update:2018-07-24 Source: Internet

Author: User

Tags min split

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When executing a job, Hadoop divides the input data into n split and then launches the corresponding n map programs to process them separately.
How the data is divided. How split is dispatched (how to decide which Tasktracker machine the map program for split should run on). How to read the divided data. This is the question to be discussed in this article.

Start with a classic MapReduce work flow chart:

1, the operation of mapred procedures;
2, this operation will generate a job, so jobclient to Jobtracker to apply for a jobid to identify the job;
3. Jobclient submits the resources required by the job to a directory named Jobid in HDFs. These resources include jar packages, configuration files, inputsplit, etc.;
4, Jobclient to Jobtracker submit this job;
5, Jobtracker initialization of the job;
6, Jobtracker from HDFs to obtain the job's split and other information;
7, Jobtracker to Tasktracker assigned tasks;
8, Tasktracker from HDFs to obtain the relevant resources of this job;
9, Tasktracker to open a new JVM;
10, Tasktracker with a new JVM to execute map or reduce;
......
For the three questions mentioned earlier, several points in the process need to be expanded.

The first is the question of how data is divided.
In the 3rd step, the resources submitted by jobclient to HDFs include Inputsplit, which is the result of data partitioning. In other words, data partitioning is done on jobclient. Here, Jobclient uses the specified inputformat to divide the input data into a number of split forms.

InputFormat is a interface. The user needs to specify a InputFormat implement when starting the MapReduce. The InputFormat contains only two interface functions:
Inputsplit[] Getsplits (jobconf job, int numsplits) throws IOException;
Recordreader<k, v> Getrecordreader (inputsplit split, jobconf job, Reporter Reporter) throws IOException;
Getsplits is now the partitioning function to use. The job parameter is a configuration collection of tasks from which you can take the input file path that the user specified when the MapReduce was started. The Numsplits parameter is the recommended value for a split number, and whether this value is considered is determined by the specific InputFormat implementation.
The inputsplit array is returned, which describes all the split information, and a inputsplit describes a split.

Inputsplit is also a interface, the specific return of what kind of implement, which is determined by the specific inputformat. Inputsplit also has only two interface functions:
Long GetLength () throws IOException;
String[] Getlocations () throws IOException;
This interface only describes how long split is, and where the split is stored (that is, the split's machine on HDFs). It may have more than one replication, which exists on more than one machine). In addition, there is no direct description of the split information. For example: Which file does split correspond to? What is the start and end position in the file. And so the important features are not described.
Why is that so? Because of the descriptive information about split, there is no need to care about the MapReduce framework. The framework only cares about the length of the split (mainly for some statistics) and the location of the split (mainly for split scheduling, which is later in detail).
And the really important descriptive information in Split is that only InputFormat will care. When a split is required to be read, its corresponding inputsplit is passed to the InputFormat's second interface function Getrecordreader, which is then used to initialize a recordreader to parse the input data. In other words, the important information that describes split is hidden, and only specific inputformat know for themselves. It is only necessary to ensure that the inputsplit returned by Getsplits and Getrecordreader are the same implement Inputsplit. This provides a great deal of flexibility for the implementation of InputFormat.

The most common Fileinputformat (implements InputFormat) uses Filesplit (implements Inputsplit) to describe split. The following information is described in Filesplit:
private Path file; File where split is located
Private long start; Start position of Split
private long length; The length of the split, GetLength () will return it
Private string[] hosts; The machine name where split is located, Getlocations () will return it
The companion Recordreader will then get the information from the Filesplit, parsing the file named Filesplit.file from Filesplit.start to filesplit.start+ The content between filesplit.length.
As for the specific partitioning strategy, Fileinputformat defaults to generating a corresponding filesplit for each block of the file on the HDFs. So naturally, Filesplit.start is the corresponding block in the file of offset, Filesplit.length is the corresponding block length, filesplit.hosts is the location of the corresponding block.
However, you can set the "mapred.min.split.size" parameter so that the split size is larger than a block, at which point the Fileinputformat will divide several contiguous blocks into a split, A block may also be divided into separate split (but only if a split must be in a file). Split's start and length are all well-established before they are divided. The location of split will need to be integrated into all of the blocks in which it is zoned, as far as possible to find their shared location. And these blocks are likely to have no common location, so you'll need to find a nearest location to the block as split.

There is also Combinefileinputformat (implements InputFormat), which can package several split into one to avoid excessive map tasks (because the number of split determines the number of maps). Although setting the "mapred.min.split.size" parameter also allows Fileinputformat to do this, Filesplit takes a contiguous block, which in most cases may not have a common location.
Combinefileinputformat uses Combinefilesplit (implements Inputsplit) to describe split. The Combinefilesplit members are as follows:
Private path[] paths; Each sub-split corresponds to a file
Private long[] Startoffset; The starting position of each sub-split in the corresponding file
Private long[] lengths; The length of each sub split
Private string[] locations; The machine name where split is located, Getlocations () will return it
Private long totlength; The sum of all sub-split lengths, GetLength () returns it
The first three arrays must be of equal length and one by one corresponding, describing each sub-split information. and locations, note that it does not describe each sub-split, but describes the entire split. This is because Combinefileinputformat, when packaging a group of split, takes into account the location of the sub-split and tries to package the split that appears in the same location (or adjacent position), Generates a combinefilesplit. And after the packaging of the locations nature is by all sub-split location integration.
Similarly, the recordreader used will get information from the Combinefilesplit, Parse each file named Combinefilesplit.paths from Combinefilesplit.startoffset to combinefilesplit.startoffset+ The content between combinefilesplit.lengths.
Specific to the partitioning strategy, Combinefilesplit first split the input file into several sub-split, each sub-split corresponding file in the HDFs block. Then, according to the "mapred.max.split.size" configuration, the length of the sum not exceeding the value of a few sub-split with a common location to package up, get a combinefilesplit. In the end, there may be some sub-split, which does not satisfy the condition of having a common location, so when you package them you need to find a location that is closest to these sub-split location as split.

Sometimes, the input file may not be divided (for example, it is a tar.gz, partitioning will cause it to be uncompressed), this is also the design inputformat need to consider. You can overload Fileinputformat's issplitable () function to tell the file to be non-partitioned, or simply to implement your own inputformat from scratch.
Because the Inputsplit interface is very flexible, you can also design a strange way of partitioning.

Next comes the question of how split can be dispatched.
Before dividing the input data, we constantly mention the location of this thing. Inputsplit interface has getlocations (), InputFormat implement in the generation of inputsplit need to care about the location of the block, And when more than one block needs to be put into a inputsplit, it also needs to merge the location.
So what exactly does this location do for you? It is mainly used to provide a reference for split dispatch.

Let's briefly introduce how Jobtracker assigns a map task for a split to Tasktracker. In the previous flowchart, the 6th step Jobtracker obtains the job's split information from HDFs, which generates a series of map and reduce tasks to be processed. Jobtracker does not take the initiative to partition a subset of tasks for each tasktracker, but instead puts all tasks directly in the list of pending tasks corresponding to the job.
Tasktracker periodically sends a heartbeat to Jobtracker, in addition to maintaining activity, reports the remaining quotas for the map and reduce that Tasktracker currently can perform (Tasktracker the total quota is " Mapred.tasktracker.map.tasks.maximun "and" Mapred.tasktracker.reduce.tasks.maximun "to configure). If Tasktracker has a corresponding quota for the task to be Jobtracker, Jobtracker assigns a task to the Jobtracker in the heartbeat response (assigning priority to the map task).
When assigning a map task, the location information for split will work. Jobtracker will select a map task corresponding to the closest split location based on the Tasktracker address (note that a split can have multiple locations). In this way, the location information of the block in the input file passes through a series of integrations (by InputFormat) and passes, ultimately affecting the assignment of the map task. The result is that the map task tends to handle the data stored locally to ensure efficiency.
Of course, location is just one of the factors Jobtracker is considering when assigning a map task. Jobtracker before selecting a task, you need to select a job (there may be more than one job waiting to be processed), depending on the scheduling policy of the specific TaskScheduler. The Jobtracker will then prioritize the task that needs to be retried because of the failure, and retry the task and try not to assign it to a machine that it has failed to perform.
Jobtracker does not consider location when assigning a reduce task, because in most cases, reduce handles the output of all maps, which are spread across every corner of the Hadoop cluster, and consider location as insignificant.

The last is the question of how to read the divided data.
Next, in the 10th step of the previous flowchart, Tasktracker will start a new JVM to execute the map program. When the map is executed, Inputformat.getrecordreader () is used. The returned Recordreader object reads each record in Split (the Getrecordreader function initializes the recordreader using Inputsplit).
At first glance, Recordreader seems to use split's location information to determine where the data should be read. But that is not the case. As I said before, the location of split may well have been integrated by InputFormat, and may not be the true location of the block (even if it is not guaranteed to be generated from Inputsplit on Jobclient to the present time, Block has not been moved).
Frankly speaking, the location of split is actually inputformat expect this split to be handled location, it can completely with the actual block location does not have any relationship. InputFormat can even define the location of the split as "the farthest location from the location where all the blocks contained by split are located", but most of the time we certainly want the map program to be able to read the input data locally.

So, Recordreader doesn't care about the location of split, just open its path. As mentioned earlier, Recordreader is created and returned by a specific inputformat, and it must be paired with the inputsplit used by the corresponding InputFormat. For example, corresponds to Filesplit,recordreader to read the corresponding interval in the Filesplit.file file, corresponding to the Combinefilesplit, Recordreader to read the corresponding interval for each file in the combinefilesplit.paths.

Recordreader The open operation of a path is done by Dfsclient, which gets the block information for the corresponding file in the corresponding interval to the namenode of HDFs: which blocks, each block's location, in turn. To read and write to a block, Dfsclient always uses the first location returned by Namenode, unless read-write fails to select the following location in turn.
When Namenode handles the open request (getblocklocations), after obtaining a block's location, the location is sorted by the address of the dfsclient, and the smaller the distance, the lower the row. Dfsclient will always choose the location in front of it, so the final recordreader will tend to read the local data (if any).

However, regardless of whether the block is local, Dfsclient will establish a connection to Datanode and then request the data. The files on the disk will not be read directly because the block is local, because these files are managed by Datanode and need to be datanode to find the corresponding physical file for the block, and also to coordinate the concurrent read and write of the file by Datanode. So the difference between local and non-local is only in the network transmission, the former is only on the local network protocol stack above the lap, and the latter is the real network communication. If the block is not far away, and the network is relatively smooth, non-local does not mean too much overhead.

So Hadoop first pursues map data-local, that is, the input data is stored locally. If not, then retreat to the second, the pursuit of Rack-local, that is, the input data stored in the same rack of other machines, so that the network overhead performance impact is generally not too large. If neither of these conditions is sufficient, the network transmission can be costly and Hadoop will try to avoid it. This is what was mentioned earlier, in cases where the block belonging to the same split does not have a common location, calculate the reason for the nearest position.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More