How to deal with cross-row block and unputsplit in hadoop mapreduce

Last Update:2018-12-04 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop beginners often have two questions: 1. If a hadoop block is 64 MB by default, will a row of records be divided into two blocks for text in the form of a record row? 2. when a file is read from the block for splitting, will a row of records be divided into two inputsplits? If two inputsplits are split, an inputsplit contains a row of incomplete data, will the ER er processing this inputsplit produce incorrect results?

For the above two problems, we must first clarify two concepts: block and inputsplit.

1. Block is the unit of HDFS storage files (64 MB by default );
2. inputsplit is the input unit for mapreduce to process and calculate files. It is just a logical concept. Each inputsplit does not actually cut files, it only records the location of the data to be processed (including the path and hosts of the file) and the length (determined by START and length ).

Therefore, text in the form of row records may be divided into different blocks or even different datanode records. By analyzing the getsplits method in fileinputformat, we can conclude that a row of records may also be divided into different inputsplit.

    public List<InputSplit> getSplits(JobContext job) throws IOException {        long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));        long maxSize = getMaxSplitSize(job);              // generate splits        List<InputSplit> splits = new ArrayList<InputSplit>();        List<FileStatus> files = listStatus(job);              for (FileStatus file: files) {          Path path = file.getPath();          long length = file.getLen();          if (length != 0) {            FileSystem fs = path.getFileSystem(job.getConfiguration());            BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);            if (isSplitable(job, path)) {              long blockSize = file.getBlockSize();              long splitSize = computeSplitSize(blockSize, minSize, maxSize);                    long bytesRemaining = length;              while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {                int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);                splits.add(makeSplit(path, length-bytesRemaining, splitSize,                                         blkLocations[blkIndex].getHosts()));                bytesRemaining -= splitSize;              }                    if (bytesRemaining != 0) {                splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,                           blkLocations[blkLocations.length-1].getHosts()));              }            } else { // not splitable              splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts()));            }          } else {             //Create empty hosts array for zero length files            splits.add(makeSplit(path, 0, length, new String[0]));          }        }        // Save the number of input files for metrics/loadgen        job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());        LOG.debug("Total # of splits: " + splits.size());        return splits;      }

The code above shows that splitting a file is actually very simple: Get the path and block information of the file on HDFS, and then according to the splitsize

Split the file. splitsize = computesplitsize (blocksize, minsize, maxsize); blocksize, minsize, and maxsize can all be configured. The default splitsize is equal to the default value of blocksize (64 MB ).

Fileinputformat strictly splits the file according to the offset. Therefore, if a row has a long record, it may be split into different inputsplits. However, this does not affect the map. Although a row of records may be split into different inputsplits, The recordreader associated with fileinputformat is designed to be robust enough. When a row of records spans inputsplit, it can read different inputsplits until this record is read completely. In hadoop, line-based text is recorded. The default textinputformat is used, and textinputformat is associated with linerecordreader, next let's take a look at the code for reading files in the nextkeyvalue method of linerecordreader:

    while (getFilePosition() <= end) {        newSize = in.readLine(value, maxLineLength,            Math.max(maxBytesToConsume(pos), maxLineLength));        if (newSize == 0) {          break;        }

It reads files through the Readline method of linereader (in is a linereader instance:

    public int readLine(Text str, int maxLineLength,                          int maxBytesToConsume) throws IOException {        if (this.recordDelimiterBytes != null) {          return readCustomLine(str, maxLineLength, maxBytesToConsume);        } else {          return readDefaultLine(str, maxLineLength, maxBytesToConsume);        }      }            /**      * Read a line terminated by one of CR, LF, or CRLF.      */      private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume)      throws IOException {        str.clear();        int txtLength = 0; //tracks str.getLength(), as an optimization        int newlineLength = 0; //length of terminating newline        boolean prevCharCR = false; //true of prev char was CR        long bytesConsumed = 0;        do {          int startPosn = bufferPosn; //starting from where we left off the last time          if (bufferPosn >= bufferLength) {            startPosn = bufferPosn = 0;            if (prevCharCR)              ++bytesConsumed; //account for CR from previous read            bufferLength = in.read(buffer);            if (bufferLength <= 0)              break; // EOF          }          for (; bufferPosn < bufferLength; ++bufferPosn) { //search for newline            if (buffer[bufferPosn] == LF) {              newlineLength = (prevCharCR) ? 2 : 1;              ++bufferPosn; // at next invocation proceed from following byte              break;            }            if (prevCharCR) { //CR + notLF, we are at notLF              newlineLength = 1;              break;            }            prevCharCR = (buffer[bufferPosn] == CR);          }          int readLength = bufferPosn - startPosn;          if (prevCharCR && newlineLength == 0)            --readLength; //CR at the end of the buffer          bytesConsumed += readLength;          int appendLength = readLength - newlineLength;          if (appendLength > maxLineLength - txtLength) {            appendLength = maxLineLength - txtLength;          }          if (appendLength > 0) {            str.append(buffer, startPosn, appendLength);            txtLength += appendLength;          }        } while (newlineLength == 0 && bytesConsumed < maxBytesToConsume);   <span style="color: #ff0000;">//①</span>              if (bytesConsumed > (long)Integer.MAX_VALUE)          throw new IOException("Too many bytes before newline: " + bytesConsumed);            return (int)bytesConsumed;      }

We will analyze the readdefaline line method. the do-while loop body mainly reads files and then traverses the read content. Find the default line break to terminate the loop. As mentioned above, for rows across inputsplit, linerecordreader will automatically read data across inputsplit. This is reflected in the termination condition of the while loop of the above Code:

While (newlinelength = 0 & bytesconsumed <maxbytestoconsume );

Newlinelength = 0 indicates that no line break is encountered in the content read in a do-while loop, because the default value of maxbytestoconsume is integer. max_value, so if the read content does not encounter a line break, it will continue to read, knowing that the read content exceeds maxbytestoconsume. This method solves the problem of reading a row of records across inputsplit, and also causes the following two questions:

1. Since the inputsplit end is not processed in the linereader reading method, will it be read infinitely when an inputsplit is read?

2. if one row of record l spans two inputsplits: A and B, and this record l spans both A and B when reading a, then when reading the inputsplit of B, what if I do not read the part of the l record in B?

To solve these two problems, hadoop uses the following code: linerecordreader's nextkeyvalue method.

    public boolean nextKeyValue() throws IOException {        if (key == null) {          key = new LongWritable();        }        key.set(pos);        if (value == null) {          value = new Text();        }        int newSize = 0;        // We always read one extra line, which lies outside the upper        // split limit i.e. (end - 1)        while (getFilePosition() <= end) {        <span style="color: #ff0000;"> //②</span>          newSize = in.readLine(value, maxLineLength,              Math.max(maxBytesToConsume(pos), maxLineLength));          if (newSize == 0) {            break;          }          pos += newSize;          inputByteCounter.increment(newSize);          if (newSize < maxLineLength) {            break;          }                // line too long. try again          LOG.info("Skipped line of size " + newSize + " at pos " +                    (pos - newSize));        }        if (newSize == 0) {          key = null;          value = null;          return false;        } else {          return true;        }      }

When the while condition is obtained at code ②, The inputsplit read boundary is guaranteed. If there is a record across inputsplit, it has to be read once across inputsplit.

Let's take a look at the initialize method of linerecordreader:

    // If this is not the first split, we always throw away first record      // because we always (except the last split) read one extra line in      // next() method.      if (start != 0) {        start += in.readLine(new Text(), 0, maxBytesToConsume(start));      }      this.pos = start;

If it is not the first inputsplit, The linerecordreader will automatically ignore all content before the first line break, so that there is no re-reading problem.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More