Hadoop self-study note (3) MapReduce Introduction

Source: Internet
Author: User
Tags hadoop fs
1. mapcecearchitecturemapreduce is a programmable framework. Most MapReduce jobs can be completed using Pig or Hive, but you still need to understand how MapReduce works, because this is the core of Hadoop, you can also prepare for optimization and writing by yourself. JobClient is the JobTracker and Task

1. mapReduce Architecture MapReduce is a programmable framework. Most MapReduce jobs can be completed using Pig or Hive, but you still need to know how MapReduce works, because this is the core of Hadoop, you can also prepare for optimization and writing by yourself. Job Client, that is, the User Job Tracker and Task

1. MapReduce Architecture

MapReduce is a programmable framework. Most MapReduce jobs can be completed using Pig or Hive, but we still need to know how MapReduce works, because this is the core of Hadoop, you can also prepare for optimization and writing by yourself.

Job Client is the user

Job Tracker and Task Tracker are also a Master-Slave build

Workflow (MapReduce Pipeline)

The Job Client submits the MapReduce program (such as the binary file in the jar package), the required data, the output location, and the Job Tracker. the Job Tracker first asks the Name Node about the blocks in which the required data is stored, and then selects a Task Tracker (the one closest to the required data, it may be on the same Node, on the same Rack, or on different rack), send the Task to the Task Tracker, and the Task Tracker will actually execute the Task. Task Tracker has the Task Slots to actually execute these tasks. If the execution fails, the Task Tracker will report it to the Job Tracker, and the Job Tracker will be assigned to other Task Tracker for execution. Task Tracker continuously reports to Job Tracker during execution. After the Task Tracker is executed, it is reported to the Job Tracker, and the Job Tracker updates the Task status to successful.

Note: When you submit a MapReduce task, you not only submit the task to the Job Tracker, but also copy it to a public location (the coffee location in the figure) of HDFS ), because it is easier to pass code and commands, Task Tracker can easily get the code.

Step 7 is shown in the figure.

2. MapReduce Internals

Split stage: Based on the Input Format, the Input data is divided into a small part. This stage runs with the Map task at the same time, and the Split is placed in different Mapper.

Input Format: determines how data is divided into Mapper. Such as Log, database, binary code, and so on.

Map stage: the splits passed after the split is converted into some key-value pairs. The conversion depends on how the user code is written.

Shuffle & Sort stage: sorts the data obtained from the Map stage and sends it to Reducers.

Reduce stage: Integrate the input Map Data (Key, Value) according to the user's code.

Output Format: After the Reduce stage is processed, the result is put into the Output directory of HDFS in this Format.

Imperative Programming Paradigm: the process of changing the state of a program as a series of processes. That is, programmatic programming. Pay more attention to objects and statuses.

Functional Programming Paradigm: it is basically a function-based Programming that uses a series of computations as a mathematical function. Hadoop uses this programming paradigm. There are inputs and outputs, and no objects are in no state.

For the sake of optimization, Hadoop also adds more interfaces. For details about the combine stage, see. The main task is to perform a small Reduce computing locally before it is delivered to the Shuffle/sort stage. This saves a lot of bandwidth (Do you still remember to put the job code in a public region)

The above process may seem less intuitive, but this is the most difficult part for Hadoop to understand. Understanding this process (Hadoop Pipeline) makes it easier to understand future content.

3. MapReduce Example

For example, how does Hadoop complete the preceding tasks in actual machine operations.

A hyperV software is installed in Windows, which contains four Hadoop nodes, each of which is a Ubuntu environment.

We can see that there is a Name Node and three Data nodes above.

First, connect to the Name Node and open a Data Node. Go to the Ubuntu System of Name Node, open a terminal, enter jps, and you can see what is running in jvm.

Run the same command on the Data Node machine. You can see that DataNode, Jps, and TaskTracker are running.

First, go to the machine of Data nodeand create a file named words.txt under the root directory. The file content is the words to be analyzed.


Step 2: Put the words.txt file into HDFS.

First

Hadoop/bin/hadoop fs-ls

View files in HDFS

Create a folder

Hadoop/bin/hadoop fs-mkdir/data

We can use a browser to check the file system in HDFS.

Enter hnname: 50070 in the browser to open the Web UI


You can view the newly created data folder in Live Nodes and execute

Hadoop/bin/hadoop fs-copyFromLocal words.txt/data

Then words.txt is copied to the/data folder. You can use the Web UI for verification.

Step 3: run the MapReduce task. This task is to count the word frequency, this task has been written by the ready-made jar package, in the hadoop/bin/directory, the hadoop-examples-1.2.0.jar. This file contains a lot of good MapReduce tasks.

Run the following command:

Hadoop/bin/hadoop jar hadoop/hadoop * examples. jar wordcount/data/words.txt/data/results

First, specify the jar package, then the program name wordcount, and then specify the input data/words.txt. Finally, the output directory/data/results is created.


After the execution is complete, you can view the execution result through the Web UI.

I wiped it. I couldn't send too many original images, so I had to delete a few images ....

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.