(ii) Hadoop example-wordcount example in running example

Last Update:2015-01-21 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop example--the WordCount example in running example

First, Requirements Description

Word count is one of the simplest and most well-thought-capable programs, known as the MapReduce version of "Hello World", and the full code of the program can be found in the "Src/examples" Directory of the Hadoop installation package. The main completion function of the word count is to count the number of occurrences of each word in a series of text files, as shown in.

Second, Environment

Vmware®workstation 10.04
Ubuntu14.04 32-bit
Java JDK 1.6.0
hadoop1.2.1

Third, WordCount Example of running

1. Create a local sample data file:

Go to "Home"-"Hadoop"-"hadoop-1.2.1" to create a folder file to store local raw data.

and create 2 files in this directory named "MyTest1.txt" and "MyTest2.txt" or any file name you want.

Enter the following example statements in each of the 2 files:

2. Create an input folder on HDFs

Outgoing terminal, enter the following command:

Bin/hadoop Fs-mkdir Hdfsinput

A security-like issue may be prompted when you execute this command, and if prompted, use the

Bin/hadoop Dfsadmin-safemode Leave

To exit Safe mode.

When the Distributed file system is in Safe mode, the contents of the file system are not allowed to be modified or deleted until the end of safe mode. The Safe mode is to check the validity of the data blocks on each datanode when the system is started, and to copy or delete some data blocks according to the policy. The run time can also be entered in safe mode through commands.

It means that we can create an input directory remotely in HDFs, and we need to upload it to this directory to execute it.

3. upload files from the local file into the Hdfsinput directory of the cluster

In the terminal, enter the following command in turn:

CD hadoop-1.2.1

Bin/hadoop fs-put file/mytest*.txt Hdfsinput

4. Operating Example:

Enter the following command at the terminal:

Bin/hadoop jar Hadoop-examples-1.2.1.jar WordCount hdfsinput hdfsoutput

Note that the sample program here is 1.2.1 version, may be inconsistent with each machine, then use * wildcard characters instead of the version number

Bin/hadoop jar Hadoop-examples-*.jar WordCount hdfsinput hdfsoutput

The following results should appear:

The Hadoop command launches a JVM to run the MapReduce program and automatically obtains the configuration of Hadoop, while adding the classpath (and its dependencies) to the Hadoop library. The above is the operation of the Hadoop job record, from here can be seen, this job is given an ID number: job_201202292213_0002, and know that the input file has two (total input paths to process:2), You can also understand the input and output records (record number and number of bytes) of the map, as well as the reduce input and output records.

To view the contents of the Hdfsoutput directory on HDFs:

Enter the following command at the terminal:

Bin/hadoop Fs-ls Hdfsoutput

Knowing that three files were generated, our results are in "part-r-00000".

Use the following command to view the contents of the result output file

Bin/hadoop Fs-cat output/part-r-00000

In the WordCount example, a space is used as the delimiter for a word.

The output directory log and the files in the input directory are permanent, if not deleted, if the results are inconsistent, please refer to this factor.

Four, WordCount Processing Process

The detailed implementation steps for WordCount are as follows:

1) split the file into splits, because the test file is small, so each file is a split, and the file is divided into rows <key,value> pairs, 4-1 is shown. This step is done automatically by the MapReduce framework, where the offset (that is, the key value) includes the number of characters in the carriage return (the Windows and Linux environments will be different).

Figure 4-1 Segmentation Process

2) Divide the good <key,value> to the user-defined map method to process, generate a new <key,value> pair, 4-2 is shown.

Figure 4-2 executing the map method

3) After getting the <key,value> of the map method output, mapper will sort them according to the key value and execute the combine process, accumulate the key to the same value, and get the final output of mapper. As shown in 4-3.

Figure 4-3 Map-end sequencing and combine process

4) Reducer first to sort the data received from the Mapper, and then to the user-defined reduce method for processing, to obtain a new <key,value> pair, and as the output of WordCount, 4-4 is shown.

Figure 4-4 reduce end ordering and output results

Article content pictures from:

Http://www.cnblogs.com/nod0620/archive/2012/06/20/2549395.html

Http://www.cnblogs.com/xia520pi/archive/2012/05/16/2504205.html

Http://www.cnblogs.com/madyina/p/3708153.html

(ii) Hadoop example-wordcount example in running example

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More