Getting started with Hadoop WordCount Program

Source: Internet
Author: User
Tags hadoop fs

Getting started with Hadoop WordCount Program

This article mainly introduces the working principle of MapReduce and explains the WordCount program in detail.

1. MapReduce Working Principle

In the book Hadoop in action, we have a good description of the MapReduce computing model. Here we reference it directly:"

In Hadoop, there are two machine roles for executing MapReduce tasks: JobTracker and TaskTracker. JobTracker is used for scheduling and TaskTracker is used for execution. A Hadoop cluster has only one JobTracker.

In distributed computing, the MapReduce framework handles complex issues such as distributed storage, Job Scheduling, load balancing, fault tolerance balancing, fault tolerance Processing, and network communication in parallel programming, the processing process is highly abstracted into two functions: map and reduce. map is responsible for dividing the task into multiple tasks, and reduce is responsible for summarizing the results of multi-task processing after the decomposition.

In Hadoop, each MapReduce task is initialized as a Job, and each Job can be divided into two stages: map stage and reduce stage. The two stages are represented by two functions, map function and reduce function. The map function receives an input in the form of <key, value>, and then generates an intermediate output in the form of <key, value>. The Hadoop function receives an input such as <key, (list ofvalues)> input in the form, and then process the value set. Each reduce generates 0 or 1 output, and the reduce output is also in the form of <key, value>."

From the above explanation, we can see that MapReduce distributes operations on large-scale datasets to each shard under the master node management, then, the final result is obtained by integrating the intermediate results of each node. Datasets (or tasks) processed by MapReduce must have the following features: datasets to be processed can be divided into many small datasets, and each small dataset can be processed in parallel. The entire process is input and output in the form of <key, value>.

A good description of MapReduce's working process:

Next we will describe the internal running process of MapReduce with a simple example. First, we will provide a data flow chart of WordCount:

Step 1: The input files file1 and file2 are first processed into two inputsplits in the form of TextInputFormat, and then input to two maps. In this case, the input format of map is <key, value>. Note that the key is the current row number (displacement) and the value is the content of the corresponding row;

Step 2: Map performs word segmentation on the content of each line. Each word is output in the form of <word, 1>. Note that the value of each word is 1;

Step 3: Input map output to the Reduce stage. TaskTracker receives <word, {, 1,…}> Reduce calculates the frequency and organizes the data into <word, sum>.

In the above process, the initial input file and the final output result are stored on HDFS, but the middle map program is only written to the local disk, not to HDFS. This is because the Map output can be deleted after Jpb is completed, so it does not need to be stored on HDFS. Although it is safer to store data on HDFS, network transmission reduces the execution efficiency of MapReduce tasks. Therefore, Map output files are written on local disks. If the Map program crashes when it has no time to transmit data to Reduce, JobTracker only needs to select another machine to re-execute this Task (JobTracker needs this function, jobTracker schedules a task to TaskTracker. When TaskTracker executes the task, it returns the report. JobTracker records the progress of the task. If the task on a TaskTracker fails to be executed, then JobTracker will assign this task to another TaskTracker until the task is completed ).

2. Detailed description of the WordCount Program

The WordCount Program is an entry-level program for learning Hadoop. We need to explain it in detail. The following steps are required to run the WordCount program: Upload a local text file to HDFS. The WordCount program implements the MapReduce process and outputs the result to HDFS.

Step 1: Go to the CentOS system (the previous blog has introduced how to build a hadoop framework on centos6.0) and create a local file folder. In the file folder, the content of file1 and file2. file1 in two new text files is Hello World, and that of file2 is Hello Hadoop.

Step 2: Create an input Folder on HDFS and upload the files in the local file to the input directory of the cluster;

Step 3: run the WordCount program on the cluster. The input directory and output directory are used as the output directory;

The above process can be completed on the centos terminal:

Mkdir ~ /File: create a file folder locally.
Cd file

Echo "Hello World"> file1.txt stores text in file1 and file2 files,
Echo "Hello Hadoop"> the function of the file2.txt echo command is to output content.> file1/2 is to output to the file.

Hadoop fs-mkdir input create input directory folder on HDFS

Hadoop fs-put ~ File/file *. txt input: upload files from the local file folder to HDFS


Run the WordCount Program
Hadoop jar/usr/local/hadoop/hadoop-0.20.2/hadoop-0.20.2-examples.jar wordcount input output
"Hadoop jar" -- execute the jar command;
'/Usr/local/hadoop/hadoop-0.20.2/hadoop-0.20.2-examples.jar' -- Address of the jar package where WordCount is located
"Wordcount" main program Class Name
"Input output": input and output folders

Hadoop fs-cat output/part-r-00000 View content in the output file

Careful readers will find that the operating commands in the hadoop framework are in the form of hadoop fs. The following describes the hadoop fs-* command:

1. hadoop fs-fs [local | <file system URI>]: Declares the file system used by hadoop. If the file system is not declared, the file configured in the current configuration file is searched in the following order: hadoop-default.xml in hadoop jar-> hadoop-default.xml under $ HADOOP_CONF_DIR-> hadoop-site.xml under $ HADOOP_CONF_DIR. Using local means to use the local file system as the hadoop DFS. If the uri is passed as the parameter, the specific file system is used as the DFS.
2. hadoop fs-ls <path>: equivalent to the local system's ls, which lists the file content in the specified directory and supports pattern matching. Output Format: filename (full path) <r n> size. n indicates the number of replica, and size indicates the size (in bytes ).
3. hadoop fs-lsr <path>: recursively lists file information that matches pattern, similar to ls, but recursively lists all subdirectories.
4. hadoop fs-du <path>: list the total space of the specified file system that matches pattern (unit: bytes ), equivalent to the directory-specific du-sb <path>/* in unix and the file-specific du-B <path>. The output format is name (full path) size (in bytes ).
5. hadoop fs-dus <path>: equivalent to-du. The output format is the same, but equivalent to du-sb of unix.
<Span style = "color: # ff0000;"> 6. hadoop fs-mv <src> <dst>: move the file in the specified format to the specified target location. When src contains multiple files, dst must be a directory.
7. hadoop fs-cp <src> <dst>: copy the file to the target location. When src contains multiple files, dst must be a directory. </Span>
8 <span style = "color: # ff0000;">, hadoop fs-rm [-skipTrash] <src>: delete a specified file that matches pattern, it is equivalent to rm <src> in unix. </Span>
9. hadoop fs-rmr [skipTrash] <src>: recursively deletes all files and directories, which is equivalent to rm-rf <src> in unix.
10. hadoop fs-rmi [skipTrash] <src>: equivalent to rm-rfi of unix <src>.
<Span style = "color: # ff0000;"> 11. hadoop fs-put <localsrc>... <Dst>: Copy files from the local system to DFS. </Span>
12. hadoop fs-copyFromLocal <localsrc>... <Dst>: equivalent to-put.
13. hadoop fs-moveFromLocal <localsrc>... <Dst>: It is equivalent to-put, except that the source file is deleted after being copied.
<Span style = "color: # ff0000;"> 14. hadoop fs-get [-ignoreCrc] [-crc] <src> <localdst>: copy a file from DFS to the local file system. The file matches pattern. If multiple files exist, dst must be a directory. </Span>
15. hadoop fs-getmerge <src> <localdst>: Copy multiple files from DFS and merge and sort them into one file to the local file system.
<Span style = "color: # ff0000;"> 16. hadoop fs-cat <src>: displays the file content. </Span>
17. hadoop fs-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>: equivalent to-get.
<Span style = "color: # ff0000;"> 18. hadoop fs-mkdir <path>: create a directory at a specified location. </Span>
19. hadoop fs-setrep [-R] [-w] <rep> <path/file>: sets the file backup level. The-R flag controls whether to recursively set subdirectories and files.
20. hadoop fs-chmod [-R] <MODE [, MODE]… | OCTALMODE> PATH... : Modify the permission of the file.-R indicates recursive modification. The MODE is a + r, g-w, + rwx, and so on, and the OCTALMODE is 755.
21. hadoop fs-chown [-R] [OWNER] [: [GROUP] PATH... : Modify the object owner and group. -R indicates recursion.
22. hadoop fs-chgrp [-R] group path... : Equivalent to-chown... : GROUP ....
23, hadoop fs-count [-q] <path>: count the number of files and the details of the occupied space. The column Meanings of the output table are: DIR_COUNT, FILE_COUNT, CONTENT_SIZE, FILE_NAME or if-q is added, QUOTA, REMAINING_QUOTA, SPACE_QUOTA, and REMAINING_SPACE_QUOTA are also listed.

The program and detailed notes are as follows:

Package test;
Import java. io. IOException;
Import java. util. StringTokenizer;
Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. io. IntWritable;
Import org. apache. hadoop. io. Text;
Import org. apache. hadoop. mapreduce. Job;
Import org. apache. hadoop. mapreduce. Mapper;
Import org. apache. hadoop. mapreduce. Cer CER;
Import org. apache. hadoop. mapreduce. lib. input. FileInputFormat;
Import org. apache. hadoop. mapreduce. lib. output. FileOutputFormat;
Import org. apache. hadoop. util. GenericOptionsParser;

Public class WordCount {

Public static class TokenizerMapper
Extends Mapper <Object, Text, Text, IntWritable> {
// Apparently, the Mapper <Object, Text, Text, IntWritable> here is a fan, actually
// Mapper <input_Key_Type, input_Value_Type, output_key_type, output_value_type>
// Specifies the data type used in map.
// These types are not included in jdk except for objects. This is hadoop's encapsulation of the Data Types in jdk,
// The Text here is equivalent to the String in jdk, And the IntWritable is equivalent to the jdk int type,
// The main reason for doing so is to sequence hadoop data.
Private final static IntWritable one = new IntWritable (1 );
// An IntWritable variable at the sound time for counting. Each time a key appears, a value = 1 is given.
Private Text word = new Text (); // This statement is used to store the key value and Text type in map output.

// Here is the map function. It also uses the model, which corresponds to the er abstract class,
// The Object key and Text value here correspond to the Object and Text above, and it is best to be the same,
// Otherwise, an error is reported during most cases.
Public void map (Object key, Text value, Context context) throws IOException,
InterruptedException {
StringTokenizer itr = new StringTokenizer (value. toString ());
// The value read by Hadoop is in the unit of action, and its key is the row number corresponding to the row.
// Because we want to calculate the number of each word, space is used as the default interval, so we use StringTokenizer to assist in string splitting,
// You can also use string. split.
While (itr. hasMoreTokens () {// traverses the words in each string,
Word. set (itr. nextToken (); // when a word appears, set it to a key and set its value to 1.
Context. write (word, one); // output the key/value.
// The above is the process of dismounting
}
}
}

Public static class IntSumReducer // static class of reduce
Extends Reducer <Text, IntWritable, Text, IntWritable> {
// This is the same as that in Map. It sets the type of input/output values.
Private IntWritable result = new IntWritable ();
Public void reduce (Text key, Iterable <IntWritable> values,
Context context
) Throws IOException, InterruptedException {
Int sum = 0;
For (IntWritable val: values ){
// Because of the scattered map, a set like {key, values }={ "hello", {,...} is obtained here.
Sum + = val. get (); // here we need to extract their values one by one and add them to the total number of occurrences, that is, the sum
}
Result. set (sum); // obtain the sum of values and set it to The value corresponding to the result.
Context. write (key, result );
// At this time, the key is the output key after the map is hashed. It does not change,
// The result is changed. The previous result is a set of numbers. At this time, the result is calculated and output as the key/value.
}
}

Public static void main (String [] args) throws Exception {
Configuration conf = new Configuration (); // obtain System Parameters
If (args. length! = 2) {// check whether the command line input path/output path is complete, that is, whether it is two parameters
System. err. println ("Usage: wordcount <in> <out> ");
System. exit (2); // exit if there are no two parameters
}

// The execution of this program is a Job in the View of hadoop, so the job Initialization is performed.
Job job = new Job (conf, "Word Count ");

// Configure the job name. This program executes the WordCount. class bytecode file.
Job. setJarByClass (WordCount. class );

// Configure various classes of the job
// Here, TokenizerMapper is used to process the Map.
// Use IntSumReducer to process Combine and Reduce.
// Use the map function of the TokenizerMapper class in this job
Job. setMapperClass (TokenizerMapper. class );

// In this job, use the reduce function of the IntSumReducer class
Job. setReducerClass (IntSumReducer. class );

// In reduce output, the key output type is Text
Job. setOutputKeyClass (Text. class );

// When reduce outputs, the output type of value is IntWritable
Job. setOutputValueClass (IntWritable. class );

// The output and input paths of the task are specified by the command line parameters and set by FileInputFormat and FileOutputFormat respectively.

// Initialize the path of the file for word Calculation
FileInputFormat. addInputPath (job, new Path (args [0]);

// Initialize the output path of the result of the file to calculate the word
FileOutputFormat. setOutputPath (job, new Path (args [1]);

// Here, the job is actually submitted to hadoop for execution,
// After setting the parameters of the task, you can call job. waitForCompletion () to execute the task.
// It means that if the job is actually executed, the main function exits. if the job is not actually executed, the job exits.
System. exit (job. waitForCompletion (true )? 0: 1 );

}
}

-------------------------------------- Split line --------------------------------------

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)

-------------------------------------- Split line --------------------------------------

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.