[Hadoop in Action] Chapter 1th Introduction to Hadoop

Source: Internet
Author: User

    • Write scalable, distributed data-intensive programs and basics

    • Understanding Hadoop and MapReduce

    • Write and run a basic MapReduce program

1. What is Hadoop

Hadoop is an open-source framework for writing and running distributed applications to handle large-scale data.

What makes Hadoop unique is the following points:

    1. Convenient--hadoop run on a large cluster of general commercial machines, or on top of cloud computing services;

    2. Robust--hadoop is committed to running on general commercial hardware, and its architecture assumes that hardware will frequently fail;

    3. The extensible--hadoop can be linearly scaled to handle larger datasets by increasing the cluster nodes;

    4. Simple--hadoop run the user to quickly write efficient parallel code.

2. Understanding Distributed Systems and Hadoop


Understand the contrast between distributed systems (scale out) and large stand-alone servers (scaling up), taking into account the cost/performance of existing I/O technologies.

Understand the difference between Hadoop and other distributed architectures ([email protected]):

The Hadoop design philosophy is that code migrates to data, while the [email protected] design philosophy is data migration.

The program to be run is a few orders of magnitude smaller than the data, and it's easier to move the data on the network than it would be to load the code on it, instead of moving the executable code to the machine where the data resides without moving the data.

3. Compare SQL database and Hadoop

SQL (Structured Query language) is designed for structured data, and many of Hadoop's first applications target unstructured data such as text. Let's make a more detailed comparison of Hadoop with a typical SQL database from a specific perspective:

    1. Scale out instead of scaling up--the cost of extending a commercial relational database is more expensive

    2. Using key/value pairs instead of relational tables--hadoop use key/value pairs as basic data units that are flexible enough to handle less structured data types

    3. Using functional programming (MapReduce) instead of declarative queries (SQL)-in MapReduce, the actual data processing steps are specified by you, much like an execution plan for the SQL engine

    4. Offline processing instead of on-line processing--hadoop is designed for off-line processing and large-scale data analysis, and is not suitable for online transaction processing patterns that are randomly read and written to several records

4. Understanding MapReduce

MapReduce is a data processing model, and the greatest advantage is that it can be easily extended to process data on multiple compute nodes;

In the MapReduce model, the data processing primitives are called Mapper and reducer;

Decomposition of a data processing application for mapper and reducer is sometimes cumbersome, but once a MapReduce form is written, an application can be expanded to run on hundreds of, thousands of, or even tens of thousands of machines in the cluster by simply modifying the configuration.

[To expand a simple program]

A small amount of document processing: For each document, Word segmentation is used to extract the words one by one, and for each word, add 1 to the corresponding item in the Multiset WordCount, and the last display () function prints out all the entries in the WordCount.


A large number of document processing methods: The work is distributed across multiple machines, each machine processes different parts of the document, and when all the machines are complete, the second processing phase merges the results.

Some of the details may prevent the program from working as expected, such as excessive document reads causing the bandwidth performance of the central storage server to keep up, and multiple collection WordCount entries exceeding the computer's memory capacity. In addition, the second phase of only one computer processing wordcount tasks, prone to bottlenecks, so can be used in a distributed way, in some way to split it to multiple computers, so that it can run independently, that needs to be in the first phase after the WordCount partition, Allows each computer in the second stage to process only one partition.

To make it work on a distributed computer cluster, you need to add the following features:

    • Store files on many computers (first stage)

    • Write a disk-based hash table so that processing is not limited by memory capacity

    • Dividing intermediate data from the first stage (i.e., wordcount)

    • Shuffle these partitions to the appropriate computer on the second stage

The MapReduce program execution is divided into two main stages, mapping and reducing, each defined as a data processing function, called Mapper and Reducer, respectively. In the mapping phase, MapReduce acquires the input data and loads the data unit into the mapper; In the reduce phase, reducer processes all outputs from mapper and gives the final result. In short, mapper means filtering and converting the input so that reducer can complete the aggregation.

In addition, in order to extend the distributed Word statistics program, we have to write the partitioning and shuffling functions.

Writing applications in the MapReduce framework is the process of customizing Mapper and Reducer, and the following is the complete flow of data:

    1. The input of the app must be organized into a list of key/value pairs (<k1,v1>);

    2. A list of key/value pairs is split, and each individual key/value pair <k1,v1> is processed by calling Mapper's map function;

    3. The output of all mapper is aggregated into a huge list of <k2,v2> pairs;

    4. Each reducer processes each of the aggregated <k2,list (v2), and outputs <k3,v3>.

5. Use Hadoop to count words--run the first program

    • Linux operating system

    • JDK1.6 above operating Environment

    • Hadoop Operating Environment

Usage:hadoop [-config Configdir] COMMAND

Command here is one of the following:

Namenode-format formatting the Dfs file system

Secondarynamenode running DFS for the second Namenode

Namenode Running DFS Namenode

Datanode running a DFS Datanode

Dfsadmin running a DFS admin client

Fsck runs a check tool for a DFS file system

FS runs an ordinary file system User Client

Balancer running a cluster load balancing tool

Jobtracker running the MapReduce jobtracker node

Pipes running a pipes job

Tasktracker running a mapreduce tasktracker node

Job Processing MapReduce Jobs

Version Print versions

Jar <jar> Run a jar file

Distcp <srcurl> <desturl> recursively copy files or directories

Archive-archivename NAME <src>* <dest> generate a Hadoop profile

Daemonlog Gets or sets the log level for each daemon

CLASSNAME run a class named CLASSNAME most commands use the w/o parameter

The Help information is typed.

The command format for running the word Statistics sample program is as follows:

Hadoop jar Hadoop-*-examples.jar WordCount [-M <maps>] [-R Reduces] input output

The command form for compiling the modified Word statistics program is as follows:

Javac-classpath hadoop-*-core.jar-d playground/classes Playground/src/wordcount.java

JAR-CVF playground/src/wordcount.jar-c playground/classes/

The command format for running the modified Word count program is as follows:

Hadoop jar Playground/wordcount.jar org.apache.hadoop.examples.WordCount Input Output

Code Listing Wordcount.java

package org.apache.hadoop.examples;import java.io.ioexception;import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration;import org.apache.hadoop.fs.path;import  org.apache.hadoop.io.intwritable;import org.apache.hadoop.io.text;import  org.apache.hadoop.mapreduce.job;import org.apache.hadoop.mapreduce.mapper;import  Org.apache.hadoop.mapreduce.reducer;import org.apache.hadoop.mapreduce.lib.input.fileinputformat;import  org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import  org.apache.hadoop.util.genericoptionsparser;public class wordcount {  public  static class tokenizermapper        extends mapper< Object, text, text, intwritable>{    private final static  intwritable one = new intwritable (1);    private text  Word = new text ();     public void map (Object key, Text  value, Context context                     )  throws ioexception, interruptedexception  {      stringtokenizer itr = new stringtokenizer ( Value.tostring ());    //(1) Use spaces for Word segmentation       while  ( Itr.hasmoretokens ())  {        word.set (Itr.nextToken ());    //(2) put token into the text object         context.write (Word, one);       }    }  }  public static  class intsumreducer        extends reducer<text, Intwritable,text,intwritable> { &nbsP;  private intwritable result = new intwritable ();     Public void reduce (text key, iterable<intwritable> values,                          Context context                        )  throws IOException,  interruptedexception {      int sum = 0;       for  (intwritable val : values)  {         sum += val.get ();      }       result.set (sum);       context.write (Key, result);    //(3) Outputs the statistical results of each token  &nbsP;  }  }  public static void main (String[] args)  throws  exception {    configuration conf = new configuration ();     string[] otherargs = new genericoptionsparser (Conf, args). Getremainingargs ();    if  (otherargs.length < 2)  {       system.err.println ("USAGE:&NBSP;WORDCOUNT&NBSP;&LT;IN&GT;&NBSP;[&LT;IN&GT; ....")  <out> ");       system.exit (2);    }     job job = new job (conf,  "Word count");     Job.setjarbyclass (Wordcount.class);     job.setmapperclass (Tokenizermapper.class);     job.setcombinerclass (Intsumreducer.class);     job.setreducerclass ( Intsumreducer.class);     jOb.setoutputkeyclass (Text.class);     job.setoutputvalueclass (IntWritable.class);     for  (int i = 0; i < otherargs.length - 1;  ++i)  {      fileinputformat.addinputpath (Job, new Path ( Otherargs[i]);     }    fileoutputformat.setoutputpath (job,       new path (otherargs[otherargs.length - 1]);     system.exit (Job.waitforcompletion (true)  ? 0 : 1);   }}


In the position of (1) WordCount uses Java's stringtokenizer with the default configuration, which is based only on the empty glyd participle. To omit standard punctuation during the word breaker, add them to the StringTokenizer delimiter list:

StringTokenizer ITR = new StringTokenizer (value.tostring (), "\t\n\r\f,.:;?! []’");

Because you want Word statistics to ignore case, turn all the words into lowercase before converting them to text objects:

Word.set (Itr.nexttoken (). toLowerCase ());

You want to show only words that appear more than 4 times:

if (Sum > 4) context.write (key, result);

6. Hadoop history

Founder: Doug Cutting

Around 2004--google published two papers to discuss the Google File system (GFS) and the MapReduce framework.

January 2006-Yahoo hired Doug to work with a dedicated team to improve Hadoop and use it as an open source project.


This article is from the "Data Craftsman" blog, please be sure to keep this source http://artist.blog.51cto.com/4061938/1716297

[Hadoop in Action] Chapter 1th Introduction to Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.