Split quantity and reader read principle in Hadoop

Source: Internet
Author: User

Draw a simple Hadoop execution diagram

Here I take the word count as an example, set the minimum slice value and the maximum slice value in Wcapp (in the source code of the split number calculation rule in the previous blog post), setting the maximum slice value to 13, or 13 bytes

The data to be counted

Here's a question. We set the slice value small, the first slice reads: Hello World T, then a slice does not fit a row, the data is cut off, then reader how to read it?

Here we are still in the Jobsubmitter class Writesplits interrupt point debug Run the program

Step over to Maps = this.writenewsplits (job, jobsubmitdir); this line, then step into

Then enter into the input.getsplits (job)

We can see minsize=1,maxsize = 13, which is the value we just set for the maximum size of the dice. (as to how these methods get these values in the previous article have been talked about)

Here is a lot of source code, to determine whether it can be cut into split (some compression is not cut), these do not say, interested in their own experimental research.

Break points directly at the bottom of the method

You can see that there are four slices, we have three rows of data, each row of data is larger than the slice value, and four slices are said to pass. In determining the contents of each slice

You can see 0~3 a total of four slices, the first slice row offset is 0, there are 13 bytes, and so on, and so on, only the last slice has only 10 bytes of data.

A slice corresponds to a map task, so the first slice of his map task cut to the data should be Hello World T

So we just hit a breakpoint in our own wcmapper, and the program runs here to see if the value is Hello World t .

We can see that the value is a complete line, not the Hello World T

That's reasonable. How to read by the slice value, so big can read to incomplete data, we calculate the result also can be wrong.

Here is a very important class recordreader, that is, the top of the map, InputFormat to map in the middle of a reader, you can define this slice, but reader to read the data, he will first determine whether the location of the reading is the beginning, if, Will read the carriage return until the beginning of the line, and it will read from the next line. That's why the second slice of our content will be a complete line of content, not read from the first line of the OM.

So we have four slices, but we have only the first three split have data, the last split is empty, because we only have three rows of data, according to reader's reading data law, to the last split when we are nearly no data to read.

Here we may ask, who does split and reader listen to? Or do they feel the same way about their function?

Assuming that the size of split is 128M, we read n rows of data, read half in the nth row, if you do not read the rest, the data will be lost, here it may be said that the next slice of the data read and re-analysis, but in the case of concurrency, the slice is very likely to run on different nodes, how to connect the two data together So this time I need reader, even if we split value is full, we still have to finish reading this line. (Reader is a line of data sent to mapper)

Slices are defined in the general direction, and this reader is processing details so that you do not lose data or the data is not garbled.

  

Split quantity and reader read principle in Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.