Hadoop source code parsing: How does textinputformat process cross-split rows?

Source: Internet
Author: User

We know that hadoop will use inputformat to pre-process the data before processing the data to the map:

  • Split the input data and generate a group of splits. One split is distributed to a mapper for processing.
  • For each split, create a recordreader to read the data in the split, and organize a record in the form of <key, value> to be passed to the map function for processing.

The most common formatinput is textinputformat. In terms of split reading, it reads data to the split by row and uses the offset of the first row in the file as the key, the recordreader: linerecordreader is the recordreader created and used by the map function to process the value of the row data. with regard to this part of logic, there is a common question at the beginning of contact with hadoop: If a row is split into two splits (this is almost certainly the case ), how is textinputformat handled? If a line is cut into two splits, it is a type of damage to the data and may affect the correctness of the data analysis (for example, word count is an example ). to solve this problem, you still need to get to know the detailed working method of textinputformat from the source code. Here, we will simply sort out the record as follows (this article refers to the source code of hadoop1.1.2 ):

1. linerecordreader will create an org. apache. hadoop. util. the linereader instance depends on the Readline method of the linereader to read a row of records. For details, refer to Org. apache. hadoop. mapred. linerecordreader. next (longwritable, text), line 176), then the key logic is in this Readline method. below is the source code of this method with additional Chinese comments added. the main logic of this method is summarized as follows:

  • Data is always read from the buffer. If the data in the buffer is read, load the next batch of data to the buffer first.
  • Find "end of line" in the buffer, and copy the data starting from the end of the line to STR (that is, the final value ). if you encounter "end of line", Continue loading new data to the buffer for search.
  • The key points are:The data given to the buffer is directly read from the file. It does not take into account whether the split limit is exceeded. Instead, it reads the data until the end of the current row.
/*** Read one line from the inputstream into the given text. A line * can be terminated by one of the following: '\ n' (LF),' \ R' (CR ), * Or '\ r \ n' (CR + LF ). EOF also terminates an otherwise unterminated * line. ** @ Param STR the object to store the given line (without newline) * @ Param maxlinelength the maximum number of bytes to store into STR; * The Rest Of The line is silently discarded. * @ Pa Ram maxbytestoconsume the maximum number of bytes to consume * In this call. this is only a hint, because if the line Cross * this threshold, we allow it to happen. it can overshoot * potentially by as much as one buffer length. ** @ return the number of bytes read including the (longest) newline * found. ** @ throws ioexception if the underlying stream throws */Public int Readline (Text STR, int Maxlinelength, int maxbytestoconsume) throws ioexception {/* We're re-reading data from in, but the head of the stream may be * already buffered in buffer, so we have several cases: * 1. no newline characters are in the buffer, so we need to copy * everything and read another buffer from the stream. * 2. an unambiguously terminated line is in buffer, so we just * Copy to Str. * 3. ambiguously termin Ated line is in buffer, I. e. buffer ends * in Cr. in this case we copy everything up to Cr to STR, but * we also need to see what follows Cr: If it's lf, then we * need consume lf as well, so next call to Readline will read * From after that. * We use a flag prevcharcr to signal if previous character was Cr * and, if it happens to be at the end of the buffer, delay * consuming it until we have a C Hance to look at the char that * follows. */Str. clear (); int txtlength = 0; // tracks Str. getlength (), as an optimization int newlinelength = 0; // length of terminating newline Boolean prevcharcr = false; // true of Prev char was Cr long bytesconsumed = 0; do {int startposn = bufferposn; // starting from where we left off the last time // if the data in the buffer is read, load a batch of data into the buffer if (bufferposn> = bufferlengt H) {startposn = bufferposn = 0; If (prevcharcr) ++ bytesconsumed; // account for Cr from previous read bufferlength = in. read (buffer); If (bufferlength <= 0) break; // EOF} // note: the logic here is a bit tricky, different operating systems have different definitions of "line terminator": // Unix: '\ n' (LF) // Mac:' \ R' (CR) // windows: '\ r \ n' (CR) (LF) // to accurately determine the end of a row, the program's judgment logic is: // 1. if the current character is lf, it must be at the end of the line, but you need to refer to the previous // character, because if the first character is Cr, It is a Windows file, "length of the line terminator" // (variable: newlinelength, This variable name is a bit bad) It should be 2, otherwise it is a UNIX file, and the "line terminator length" is 1. // 2. If the current character is not lf, check whether the previous character is Cr. If yes, make sure that the last character is the end of the line. This is a Mac file. // 3. if the current symbol is Cr, you also need to determine the "length of the line terminator" based on whether or not the next character is lf. Therefore, you only need to mark prevcharcr = true for reference when reading the next character. For (; bufferposn <bufferlength; ++ bufferposn) {// search for newline if (buffer [bufferposn] = lf) {newlinelength = (prevcharcr )? 2: 1; ++ bufferposn; // At next invocation proceed from following byte break;} If (prevcharcr) {// Cr + notlf, we are at notlf newlinelength = 1; break;} prevcharcr = (buffer [bufferposn] = Cr);} int readlength = bufferposn-startposn; If (prevcharcr & newlinelength = 0) -- readlength; // CR at the end of the buffer bytesconsumed + = readlength; int appendlength = readlength-newlinelength; if (Appendlength> maxlinelength-txtlength) {appendlength = maxlinelength-txtlength;} If (appendlength> 0) {Str. append (buffer, startposn, appendlength); txtlength + = appendlength;} // newlinelength = 0 indicates that the end of the row is never read, the program continues to read data from the file through the file input stream. // Here is a very important part: Create a self-constructor for an in instance: Org. apache. hadoop. mapred. linerecordreader. linerecordreader (configuration, filesplit) // row 86th: fsdatainputstream filein = FS. open (split. getpath (); we can see that: // For linerecordreader: When it gets a "row", it must read the complete row and will not be affected by filesplit, because it reads the file where filesplit is located, rather than within the limit of filesplit. // So there will be no "Broken Line" problem!} While (newlinelength = 0 & bytesconsumed <maxbytestoconsume); If (bytesconsumed> (long) integer. max_value) throw new ioexception ("Too bytes before newline:" + bytesconsumed); Return (INT) bytesconsumed ;}

2. according to the preceding behavior of Readline, when a cross-split row is encountered, the data will continue to be read until the end of the row in the next split, so how does one determine whether the first line of the next split has been read by the linerecordreader of the previous split to avoid missing or repeated reading of the first line? In this regard, linerecordreader uses a simple and clever method:Since it cannot be determined whether a row starting with each split is an independent row or a part of the cut-off row, skip the start line of each split (except the first split of course ), read data from the second row, and then always read one more row at the end of the split, so that the data can be connected and avoid the trouble caused by the broken line.. The source code is as follows:

The constructor Org. apache. hadoop. mapred. linerecordreader. when linerecordreader (configuration, filesplit) lines 108 to 113 are used to determine the start position, note: the first line will be ignored!

// If this is not the first split, we always throw away first record    // because we always (except the last split) read one extra line in    // next() method.    if (start != 0) {      start += in.readLine(new Text(), 0, maxBytesToConsume(start));    }

Correspondingly, you can use linerecordreader to determine whether there is another row: Org. apache. hadoop. mapred. linerecordreader. in the next (longwritable, text) rows from 170 to 173, the condition that while is used is: the current position is smallerOr equalThe end of the split.: When the current position is at the end of the split, the while statement will still be executed once. This reading is obviously the start line of the next split.!

    // We always read one extra line, which lies outside the upper    // split limit i.e. (end - 1)    while (getFilePosition() <= end) {...

Summary:

Now, the logic for reading cross-split rows is complete.From an extended perspective, this is a general problem of data splitting in the early stage of Map-Reduce, that is, no matter how we split and read a small part of big data, this includes the problem of continuous parsing when we implement our inputformat. in this regard, we should be deeply aware that the most direct practical function of split is to extract a small part of big data for mapper processing, but this is only a kind of "logic, in terms of "macroscopic" splitting, in terms of "Microscopic", in the first and last splitting of split, in order to ensure data continuity, it is completely justified and reasonable to span split and splice data.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.