Split quantity and reader read principle in Hadoop

Last Update:2018-10-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Draw a simple Hadoop execution diagram

Here I take the word count as an example, set the minimum slice value and the maximum slice value in Wcapp (in the source code of the split number calculation rule in the previous blog post), setting the maximum slice value to 13, or 13 bytes

The data to be counted

Here's a question. We set the slice value small, the first slice reads: Hello World T, then a slice does not fit a row, the data is cut off, then reader how to read it?

Here we are still in the Jobsubmitter class Writesplits interrupt point debug Run the program

Step over to Maps = this.writenewsplits (job, jobsubmitdir); this line, then step into

Then enter into the input.getsplits (job)

We can see minsize=1,maxsize = 13, which is the value we just set for the maximum size of the dice. (as to how these methods get these values in the previous article have been talked about)

Here is a lot of source code, to determine whether it can be cut into split (some compression is not cut), these do not say, interested in their own experimental research.

Break points directly at the bottom of the method

You can see that there are four slices, we have three rows of data, each row of data is larger than the slice value, and four slices are said to pass. In determining the contents of each slice

You can see 0~3 a total of four slices, the first slice row offset is 0, there are 13 bytes, and so on, and so on, only the last slice has only 10 bytes of data.

A slice corresponds to a map task, so the first slice of his map task cut to the data should be Hello World T

So we just hit a breakpoint in our own wcmapper, and the program runs here to see if the value is Hello World t .

We can see that the value is a complete line, not the Hello World T

That's reasonable. How to read by the slice value, so big can read to incomplete data, we calculate the result also can be wrong.

Here is a very important class recordreader, that is, the top of the map, InputFormat to map in the middle of a reader, you can define this slice, but reader to read the data, he will first determine whether the location of the reading is the beginning, if, Will read the carriage return until the beginning of the line, and it will read from the next line. That's why the second slice of our content will be a complete line of content, not read from the first line of the OM.

So we have four slices, but we have only the first three split have data, the last split is empty, because we only have three rows of data, according to reader's reading data law, to the last split when we are nearly no data to read.

Here we may ask, who does split and reader listen to? Or do they feel the same way about their function?

Assuming that the size of split is 128M, we read n rows of data, read half in the nth row, if you do not read the rest, the data will be lost, here it may be said that the next slice of the data read and re-analysis, but in the case of concurrency, the slice is very likely to run on different nodes, how to connect the two data together So this time I need reader, even if we split value is full, we still have to finish reading this line. (Reader is a line of data sent to mapper)

Slices are defined in the general direction, and this reader is processing details so that you do not lose data or the data is not garbled.

Split quantity and reader read principle in Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Split quantity and reader read principle in Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Split quantity and reader read principle in Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support