Hadoop beginners often have two questions: 1. If a hadoop block is 64 MB by default, will a row of records be divided into two blocks for text in the form of a record row? 2. when a file is read from the block for splitting, will a row of records be divided into two inputsplits? If two inputsplits are split, an inputsplit contains a row of incomplete data, will the ER er processing this inputsplit produce incorrect results?
For the above two problems, we must first clarify two concepts: block and inputsplit.
1. Block is the unit of HDFS storage files (64 MB by default );
2. inputsplit is the input unit for mapreduce to process and calculate files. It is just a logical concept. Each inputsplit does not actually cut files, it only records the location of the data to be processed (including the path and hosts of the file) and the length (determined by START and length ).
Therefore, text in the form of row records may be divided into different blocks or even different datanode records. By analyzing the getsplits method in fileinputformat, we can conclude that a row of records may also be divided into different inputsplit.
public List<InputSplit> getSplits(JobContext job) throws IOException { long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job)); long maxSize = getMaxSplitSize(job); // generate splits List<InputSplit> splits = new ArrayList<InputSplit>(); List<FileStatus> files = listStatus(job); for (FileStatus file: files) { Path path = file.getPath(); long length = file.getLen(); if (length != 0) { FileSystem fs = path.getFileSystem(job.getConfiguration()); BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length); if (isSplitable(job, path)) { long blockSize = file.getBlockSize(); long splitSize = computeSplitSize(blockSize, minSize, maxSize); long bytesRemaining = length; while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) { int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining); splits.add(makeSplit(path, length-bytesRemaining, splitSize, blkLocations[blkIndex].getHosts())); bytesRemaining -= splitSize; } if (bytesRemaining != 0) { splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining, blkLocations[blkLocations.length-1].getHosts())); } } else { // not splitable splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts())); } } else { //Create empty hosts array for zero length files splits.add(makeSplit(path, 0, length, new String[0])); } } // Save the number of input files for metrics/loadgen job.getConfiguration().setLong(NUM_INPUT_FILES, files.size()); LOG.debug("Total # of splits: " + splits.size()); return splits; }
The code above shows that splitting a file is actually very simple: Get the path and block information of the file on HDFS, and then according to the splitsize
Split the file. splitsize = computesplitsize (blocksize, minsize, maxsize); blocksize, minsize, and maxsize can all be configured. The default splitsize is equal to the default value of blocksize (64 MB ).
Fileinputformat strictly splits the file according to the offset. Therefore, if a row has a long record, it may be split into different inputsplits. However, this does not affect the map. Although a row of records may be split into different inputsplits, The recordreader associated with fileinputformat is designed to be robust enough. When a row of records spans inputsplit, it can read different inputsplits until this record is read completely. In hadoop, line-based text is recorded. The default textinputformat is used, and textinputformat is associated with linerecordreader, next let's take a look at the code for reading files in the nextkeyvalue method of linerecordreader:
while (getFilePosition() <= end) { newSize = in.readLine(value, maxLineLength, Math.max(maxBytesToConsume(pos), maxLineLength)); if (newSize == 0) { break; }
It reads files through the Readline method of linereader (in is a linereader instance:
public int readLine(Text str, int maxLineLength, int maxBytesToConsume) throws IOException { if (this.recordDelimiterBytes != null) { return readCustomLine(str, maxLineLength, maxBytesToConsume); } else { return readDefaultLine(str, maxLineLength, maxBytesToConsume); } } /** * Read a line terminated by one of CR, LF, or CRLF. */ private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume) throws IOException { str.clear(); int txtLength = 0; //tracks str.getLength(), as an optimization int newlineLength = 0; //length of terminating newline boolean prevCharCR = false; //true of prev char was CR long bytesConsumed = 0; do { int startPosn = bufferPosn; //starting from where we left off the last time if (bufferPosn >= bufferLength) { startPosn = bufferPosn = 0; if (prevCharCR) ++bytesConsumed; //account for CR from previous read bufferLength = in.read(buffer); if (bufferLength <= 0) break; // EOF } for (; bufferPosn < bufferLength; ++bufferPosn) { //search for newline if (buffer[bufferPosn] == LF) { newlineLength = (prevCharCR) ? 2 : 1; ++bufferPosn; // at next invocation proceed from following byte break; } if (prevCharCR) { //CR + notLF, we are at notLF newlineLength = 1; break; } prevCharCR = (buffer[bufferPosn] == CR); } int readLength = bufferPosn - startPosn; if (prevCharCR && newlineLength == 0) --readLength; //CR at the end of the buffer bytesConsumed += readLength; int appendLength = readLength - newlineLength; if (appendLength > maxLineLength - txtLength) { appendLength = maxLineLength - txtLength; } if (appendLength > 0) { str.append(buffer, startPosn, appendLength); txtLength += appendLength; } } while (newlineLength == 0 && bytesConsumed < maxBytesToConsume); <span style="color: #ff0000;">//①</span> if (bytesConsumed > (long)Integer.MAX_VALUE) throw new IOException("Too many bytes before newline: " + bytesConsumed); return (int)bytesConsumed; }
We will analyze the readdefaline line method. the do-while loop body mainly reads files and then traverses the read content. Find the default line break to terminate the loop. As mentioned above, for rows across inputsplit, linerecordreader will automatically read data across inputsplit. This is reflected in the termination condition of the while loop of the above Code:
While (newlinelength = 0 & bytesconsumed <maxbytestoconsume );
Newlinelength = 0 indicates that no line break is encountered in the content read in a do-while loop, because the default value of maxbytestoconsume is integer. max_value, so if the read content does not encounter a line break, it will continue to read, knowing that the read content exceeds maxbytestoconsume. This method solves the problem of reading a row of records across inputsplit, and also causes the following two questions:
1. Since the inputsplit end is not processed in the linereader reading method, will it be read infinitely when an inputsplit is read?
2. if one row of record l spans two inputsplits: A and B, and this record l spans both A and B when reading a, then when reading the inputsplit of B, what if I do not read the part of the l record in B?
To solve these two problems, hadoop uses the following code: linerecordreader's nextkeyvalue method.
public boolean nextKeyValue() throws IOException { if (key == null) { key = new LongWritable(); } key.set(pos); if (value == null) { value = new Text(); } int newSize = 0; // We always read one extra line, which lies outside the upper // split limit i.e. (end - 1) while (getFilePosition() <= end) { <span style="color: #ff0000;"> //②</span> newSize = in.readLine(value, maxLineLength, Math.max(maxBytesToConsume(pos), maxLineLength)); if (newSize == 0) { break; } pos += newSize; inputByteCounter.increment(newSize); if (newSize < maxLineLength) { break; } // line too long. try again LOG.info("Skipped line of size " + newSize + " at pos " + (pos - newSize)); } if (newSize == 0) { key = null; value = null; return false; } else { return true; } }
When the while condition is obtained at code ②, The inputsplit read boundary is guaranteed. If there is a record across inputsplit, it has to be read once across inputsplit.
Let's take a look at the initialize method of linerecordreader:
// If this is not the first split, we always throw away first record // because we always (except the last split) read one extra line in // next() method. if (start != 0) { start += in.readLine(new Text(), 0, maxBytesToConsume(start)); } this.pos = start;
If it is not the first inputsplit, The linerecordreader will automatically ignore all content before the first line break, so that there is no re-reading problem.