The basic process of Hadoop and the development of simple application
Source: Internet
Author: User
Keywordsnbsp process Rita whether
Basic process:
&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; A picture is too big to split into two parts. According to the flowchart, a specific task is performed.
1. In a distributed environment, clients create tasks and submit them.
2. Pre-InputFormat to do map pretreatment, mainly responsible for the following work:
A to verify that the input format conforms to the Jobconfig input definition, which is known when implementing map and building conf, and does not define any subclasses that can be writable.
b The file split into a logical input inputsplit, in fact, this is mentioned in the Distributed File system in the BlockSize is limited size, so large files will be divided into multiple blocks.
c) Recordreader to Inputsplit as a set of records, output to map. (Inputsplit is only the first step of logical segmentation, but how to use the information in the file to slice and still need recordreader to implement, such as the simplest default way is the segmentation of carriage return line)
3. The result of recordreader processing as a map input, map execution defines the map logic, and outputs the processed key,value pairs to the temporary intermediate file.
4. combiner selectable configuration, the primary role is to reduce the amount of data transferred during the reduce process, with local priority for reduce, after each map finishes analysis.
5. Partitioner selectable configuration, the main effect is that the result of specifying a map is handled by a certain reduce, and each reduce has a separate output file in the case of multiple reduce. (use scenario is described in the following code example)
6. Reduce performs specific business logic and outputs the processing results to OutputFormat.
7. OutputFormat is responsible for verifying that the output directory already exists, verifying that the output result type is configured in config, and then outputting the result after the reduce rollup.
code example:
Business Scenario Description:
can set the input and output path (the operating system path is not HDFs path), according to the Access log analysis of an application access to an API total number of times and total traffic, statistics, respectively, output to two files.
Just for testing purposes, there is no subdivision of many classes, merging all classes into one class to illustrate the problem.
Figure 4 test code class diagram
Loganalysiser is the main class that is responsible for creating, submitting tasks, and outputting part of the information. Several sub classes within the application can refer to the role responsibilities mentioned in the process. Take a look at the code snippets for several classes and methods:
Loganalysiser::mapclass
public static class Mapclass extends Mapreducebase
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.