MapReduce Pure Technical Point article about Hadoop

Source: Internet
Author: User
Keywords nbsp; Application value
Tags application applications client create data distributed file files
This article focuses on the entire mapreduce process of Hadoop, do not tell stories, not nonsense, focus on describing each link. Through the article to Google over a lot of hard, I took some notes, add some of their own opinion, not necessarily all right, we have to discriminate. I hope this article has some help for mapreduce students who want to learn about Hadoop.





. Target
using map/reduce algorithm

1) can compute distributed processing


a) when needed, data is always available


b) Applications don't care how many computers do the service


2 provides high reliability


a) The application does not care about the possibility of temporary or permanent failure of the machine or network





two. What does the application do?


1 defines mapper and reducer classes and a "startup" program to complete the process


2) Mapper


a) input is a key1,value1 value pair


b) Outputs a kye2,value2 value pair


3) Reducer


a) Enter a key2 and a value2


b) Outputs a KEY3,VALUE3 value pair


4 Startup program


a) Create a jobconf define a job


b) Submit jobconf to Jobtracker and wait for execution





three. Application Data Flow Diagram


See annex Figure A


 


Four. Input and output formatting

The
application also chooses input and output formats to define how persistent data can be read and written. There are interfaces that can be defined.


1 Input format


a) separate input data to determine input to each map task


b) defines a recordreader to read Key,value value pairs, which are passed to the map task


2) output format


a) gives a key,value value pair and a filename, writes the output of the reduce task to the persistent storage





Five. Output Sort


applications can control sort commands and output through Outputkeycomparator and Partitioner


1) Outputkeycomparator


a) defines how to compare serialization key values


b) Default outputkeycomparator, but this should be defined in an application.


I. is written as follows Key.compareto (Key2)


2 Partitioner Distributor


a) gives a map output key and reduce number, select a reduce


b) The default Hashpartitioner, the use of modular computing to handle the allocation of work


I. Key.hashcode% numreduces





Six. Combo


Combiners is a jobs optimizer with a combined multi-valued to a value of reduce.


1) Usually, combiner and reducer can run through the output of the map, just before the output is run to reducer. This is also consistent with the principle that mobile computing is less costly than moving data.


2) For example, WordCount mapper generation (Word,count) and combiner and reducer generate the combination of each word.


a) Input: "Hi Kevin bye kein"


b) Map output: ("Hi", 1), ("Kevin", 1), ("Bye", 1), ("Kevin", 1)


c) Combine output: ("Kevin", 2), ("Bye", 1), ("Hi", 1)





Seven. Process Communication


1) Using a common RPC implementation


a) Easy to change/expand


b) is defined as a Java interface


c) Implements the interface's service object


d) Client proxy objects are automatically generated


2 All messages occur on the client


a) Prevent loops and then deadlock


3) Error handling


a) includes timeout and communication problems


b) sends a signal to the client via IOException


c) never sends a signal to the server





Eight. Map/reduce Process


1 starts the application


a) User application code


b) Submit a detailed map/reduce job


2) Jobtracker


a) handle all job


b) Generate schedule for all task decisions


3) Tasktracker


a) manages all the resulting node tasks


4) Task


a) run a separate map or reduce segment for a specified job


b) Task to obtain
from Tasktracker




Nine. Perform process interaction diagram


See annex Figure II





10. Job Control Process


1 Application Initiator Create and submit job


2 Jobtracker Initializes the job, creates the filesplits, and adds the task to the queue


3 After the current task completes, Tasktracker requests a new map or reduce task every 10 seconds


4 Tasktraker every 10 seconds to jobtraker reporting status


5 When the job completes, Jobtracker tells Tasktraker to delete the temporary file


6 The application initiator notifies the job that it has completed and stops waiting





11. Application Initiator


1) Application code creates jobconf and parameter collections


a) defines Mapper,reducer class


b) defines InputFormat and Outformat classes


c) Define Combiner class if necessary


2 Write a jobconf and application jar to DFS and submit job to Jobtracker


3 can either exit immediately or wait for the job to complete or fail





12. Jobtracker


1) Handles jobconf and creates a InputFormat instance. Call the Getsplits method to generate the input of the map


2 Creates a Jobinprogress object and a Taskinprogress "TIP" string and Task object


a) jobinprogress the status of a job


b) Taskinprogress is the state of a working segment


c) Task is to do a tip try


d) As the Tasktrackers request begins, they are assigned to the task to execute





13. Tasktracker


1) All Tasks


a) Create Taskrunner


b) Replicate Job.jar and Job.xml from DFS


c) Localized jobconf for this task


D) calls Task.preare ()


e) A new JVM launch task under Tasktracker.child


f) Crawls the output of a task at log info level


g) Process updates for task status and send to Jobtracker every 10 seconds


h) If the job is killed, also kill the task


i) If the mission dies or completes, tell the Jobtracker





14. Tasktracker for reduces


1) as Reduces,task.prepare () for this reduce to crawl all related map outputs


2 from Tasktracker Jetty Services, files are crawled through HTTP


3 files are crawled on parallel threads, but only one
per host

4 When the crawl fails, a fallback scheme is enabled to suppress overloaded Tasktracker


5 crawls The first 33% progress of the entire reduce process





15. Map Tasks


1 Creates a recordreader from Filesplit with InputFormat object


2) circulates through the keys and values of the filesplit and supplies the mapper


3 If there is no combiner, one is written with the keys Sequencefile assigned to each reduce


4 has combiner, the framework caches 100,000 keys and values,sorts,combines, and writes it to sequencefile assigned to each reduce





16. Reduce Tasks:sort


1) Sort


a) takes 33% to 66% reduce progress


b) Basic processing


I. Read 100 (configuration parameters IO.SORT.MB) keys and values to sort


Ii. Sort
in memory

Iii. Write to Disk


c) Merging


I. Reads 10 (io.sort.factor) files and merges them into a single file


Ii. Repeating many times is necessary (100 files level two, 1000 files three, etc.)





17. Reduce Tasks:reduce


1) Consumes 66% to 100% of the reduce process


2 uses a sequencefile.reader to read the sorted input and pass to reducer a key, along with the associated values


3 Enter the value pairs of keys and values to be written as OutputFormat objects, usually to a file in DFS


4) is not adopted from the reduce output, so it is a fragment of the map output keys





finished, the best way to understand is to find a hadoop example article by point, if there is a wrong place, welcome correction!


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.