The execution steps of the MR programming model: 1. Prepare the input data for map processing
2, mapper processing
3, Shuffle
4, Reduce processing
5, the result output
1. input text information, by InputFormat -> FileInputFormat -> TextInputFormat, get the Split array through the getSplits method, and then use the getRecordReader method to handle the Split, each line is assigned to a map processing
2. All maps on each node are processed by the Partitioner on the node (Shuffling process), and the map is placed on other nodes by key or continues to be processed under the node.
3. sort
4. the results are handled by reduce
5. after processing is written to Local or Hadoop by OutputFormat -> FileOutputFormat -> TextOutputFormat
Split: The data block processed by the MR, the smallest calculation unit in the MR. The default is one-to-one correspondence with the Block in HDFS (the smallest storage unit in HDFS, the default 128M), or it can be set manually (not recommended)
InputFormat: Splits the input data (Split) InputSplit[] getSplits(JobConf var1, int var2)
TextInputFormat: used to process data in text format
OutputFormat: output
The diagram above shows:
In general, one Split corresponds to one block, but the above picture is a set.
A file file is divided into n blocks, which corresponds to 2n Splits. After InputFormat processing, each Split is processed by a Mapper. After Shuffling grouping and sorting, multiple Reducers are generated, and each Reducer will generate one. file MapReduce 1.x architecture: one JobTracker + multiple taskTracker
JobTracker: responsible for resource management and job scheduling
TrakTracker: Regularly report the health, resources, and job status of the node to the JobTracker, and receive JT commands, such as starting/killing tasks.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.