Draw a simple Hadoop execution diagram
Here I take the word count as an example, set the minimum slice value and the maximum slice value in Wcapp (in the source code of the split number calculation rule in the previous blog post), setting the maximum slice value to 13, or 13 bytes
The data to be counted
Here's a question. We set the slice value small, the first slice reads: Hello World T, then a slice does not fit a row, the data is cut off, then reader how to read it?
Here we are still in the Jobsubmitter class Writesplits interrupt point debug Run the program
Step over to Maps = this.writenewsplits (job, jobsubmitdir); this line, then step into
Then enter into the input.getsplits (job)
We can see minsize=1,maxsize = 13, which is the value we just set for the maximum size of the dice. (as to how these methods get these values in the previous article have been talked about)
Here is a lot of source code, to determine whether it can be cut into split (some compression is not cut), these do not say, interested in their own experimental research.
Break points directly at the bottom of the method
You can see that there are four slices, we have three rows of data, each row of data is larger than the slice value, and four slices are said to pass. In determining the contents of each slice
You can see 0~3 a total of four slices, the first slice row offset is 0, there are 13 bytes, and so on, and so on, only the last slice has only 10 bytes of data.
A slice corresponds to a map task, so the first slice of his map task cut to the data should be Hello World T
So we just hit a breakpoint in our own wcmapper, and the program runs here to see if the value is Hello World t .
We can see that the value is a complete line, not the Hello World T
That's reasonable. How to read by the slice value, so big can read to incomplete data, we calculate the result also can be wrong.
Here is a very important class recordreader, that is, the top of the map, InputFormat to map in the middle of a reader, you can define this slice, but reader to read the data, he will first determine whether the location of the reading is the beginning, if, Will read the carriage return until the beginning of the line, and it will read from the next line. That's why the second slice of our content will be a complete line of content, not read from the first line of the OM.
So we have four slices, but we have only the first three split have data, the last split is empty, because we only have three rows of data, according to reader's reading data law, to the last split when we are nearly no data to read.
Here we may ask, who does split and reader listen to? Or do they feel the same way about their function?
Assuming that the size of split is 128M, we read n rows of data, read half in the nth row, if you do not read the rest, the data will be lost, here it may be said that the next slice of the data read and re-analysis, but in the case of concurrency, the slice is very likely to run on different nodes, how to connect the two data together So this time I need reader, even if we split value is full, we still have to finish reading this line. (Reader is a line of data sent to mapper)
Slices are defined in the general direction, and this reader is processing details so that you do not lose data or the data is not garbled.
Split quantity and reader read principle in Hadoop