Open source framework for distributed computing Introduction to Hadoop practice (III.)

Source: Internet
Author: User
Tags file system

Hadoop Basic Process

A picture is too big to split into two parts. According to the flow chart, the implementation of a specific task.

In a distributed environment, clients create tasks and commit them.

InputFormat before the map preprocessing, mainly responsible for the following work:

Verify that the input format conforms to the Jobconfig input definition, which is known when implementing map and building conf, and does not define any subclasses that can be writable.

The input file is divided into logical input inputsplit, in fact, this is mentioned above in the Distributed File System blocksize is a size limit, so large files will be divided into multiple blocks.

The Recordreader is processed again inputsplit as a set of records, output to map. (Inputsplit is only the first step of logical segmentation, but how to use the information in the file to slice and still need recordreader to implement, such as the simplest default way is the segmentation of carriage return line)

The result of recordreader processing as the input of map, map executes the defined map logic, and the output key and value correspond to the temporary intermediate file.

combiner selectable configuration, the primary role is to reduce the amount of data transferred during the reduce process, with local priority for reduce, after each map finishes analysis.

Partitioner optionally configured, the primary role is to specify that the result of a map is handled by a certain reduce, and each reduce will have a separate output file in the case of multiple reduce. (use scenario is described in the following code example)

Reduce executes the specific business logic and outputs the processing results to the OutputFormat.

OutputFormat is responsible for verifying that the output directory already exists, verifying that the output result type is configured in config, and then outputting the result after the reduce rollup.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.