Hadoop Learning (6) WordCount example deep learning MapReduce Process (1)

Source: Internet
Author: User
Tags shuffle

It took an entire afternoon (more than six hours) to sort out the summary, which is also a deep understanding of this aspect. You can look back later.

After installing Hadoop, run a WourdCount program to test whether Hadoop is successfully installed. Create a folder using commands on the terminal, write a line to each of the two files, and then run the Hadoop, WourdCount comes with WourdCount program commands, you can output the number of different words in the sentence to be written. However, this is not the main content of this blog. It mainly aims to understand the internal mechanism of Hadoop through a simple Wordcount program. This gives you a better understanding of the detailed process of MapReduce. In Thinking in BigDate (8) Big Data Hadoop core architecture HDFS + MapReduce + Hbase + Hive internal mechanism details, we have made a rough look at the Hadoop internal cluster architecture, we also have a preliminary understanding of MapReduce. Here we will use the WourdCount program to deeply discuss the MapReduce process.

Use the command line to test the WourdCount program:

WourdCount is used to count the number of letters in the text.

1. Create a Wordcount sample file
Zhangzhen @ ubuntu :~ /Software $ mkdir inputzhangzhen @ ubuntu :~ /Software $ cd input/zhangzhen @ ubuntu :~ /Software/input $ echo "I am zhangzhen"> test1.txtzhangzhen @ ubuntu :~ /Software/input $ echo "You are not zhangzhen"> test2.txtzhangzhen @ ubuntu :~ /Software/input $ cd ../hadoop-1.2.1/zhangzhen @ ubuntu :~ /Software/hadoop-1.2.1 $ cd binzhangzhen @ ubuntu :~ /Software/hadoop-1.2.1/bin $ lshadoop slaves. sh start-mapred.sh stop-mapred.shhadoop-config.sh start-all.sh stop-all.sh task-controllerhadoop-daemon.sh start-balancer.sh stop-balancer.shhadoop-daemons.sh start-dfs.sh stop-dfs.shrcc start-jobhistoryserver.sh stop-jobhistoryserver.shzhangzhen @ ubuntu :~ /Software/hadoop-1.2.1/bin $ jps (Make sure Hadoop is up) 7101 SecondaryNameNode7193 JobTracker7397 TaskTracker9573 Jps6871 DataNode6667 NameNodezhangzhen @ ubuntu :~ /Software/hadoop-1.2.1/bin $ cd .. zhangzhen @ ubuntu :~ /Software/hadoop-1.2.1 $ lsbin data hadoop-minicluster-1.2.1.jar libexec ready build. xml docs hadoop-test-1.2.1.jar LICENSE.txt srcc ++ hadoop-ant-1.2.1.jar hadoop-tools-1.2.1.jar logs web‑changes.txt hadoop-client-1.2.1.jar ivy NOTICE.txt conf hadoop-core-1.2.1.jar ivy. xml README.txt contrib hadoop-examples-1.2.1.jar lib sbinzhangzhen @ ubuntu :~ /Software/hadoop-1.2.1 $ bin/hadoop dfs-put ../input in // put the file upload in the in directory in hdfa, in fact this statement is wrong zhangzhen @ ubuntu :~ /Software/hadoop-1.2.1 $ bin/hadoop dfs-ls. in/* ls: Cannot access. in/*: No such file or directory. zhangzhen @ ubuntu :~ /Software/hadoop-1.2.1 $ bin/hadoop dfs-ls. /in/*-rw-r -- 1 zhangzhen supergroup 15/user/zhangzhen/in/test1.txt-rw-r -- 1 zhangzhen supergroup 22/user /zhangzhen/in/test2.txt

Note: Hadoop does not have the current directory concept. Therefore, the files uploaded to hdfs cannot be viewed through the cd and ls commands. Here we use the above and the following command to view files in hdfs.

In each version, the hadoop-examples-1.2.1.jar location is different. In Hadoop1.2.1, our hadoop-examples-1.2.1.jar files are in the Hadoop directory, where we need to copy this hadoop-examples-1.2.1.jar to the/bin directory.

Run: run the hadoop-examples-1.2.1.jar to execute the files in the in directory under the bin directory and write the results to the put folder.

zhangzhen@ubuntu:~/software$ bin/hadoop jar hadoop-examples-1.2.1.jar wordcount in put

View the output result:

Zhangzhen @ ubuntu :~ /Software/hadoop-1.2.1 $ bin/hadoop dfs-lsFound 2 itemsdrwxr-xr-x-zhangzhen supergroup 0 2014-03-22/user/zhangzhen/indrwxr-xr-x-zhangzhen supergroup 0 2014-03-22/ user/zhangzhen/putzhangzhen @ ubuntu: ~ /Software/hadoop-1.2.1 $ bin/hadoop dfs-ls. /putFound 3 items-rw-r -- 1 zhangzhen supergroup 0 2014-03-22/user/zhangzhen/put/_ SUCCESSdrwxr-xr-x-zhangzhen supergroup 0 2014-03-22/user/zhangzhen /put/_ logs directory-rw-r -- 1 zhangzhen supergroup 39 2014-03-22/user/zhangzhen/put/part-r-00000 This is the file zhangzhen @ ubuntu: ~ /Software/hadoop-1.2.1/hadoop dfs-cat. /put/* I 1You 1 am 1are 1not 1 zhangzhen 2cat: File does not exist:/user/zhangzhen/put/_ logszhangzhen @ ubuntu :~ /Software/hadoop-1.2.1 $

The above results basically prove that Hadoop is no problem. Execute the hadoop-examples-1.2.1.jar program, in fact, is to compile the java program into a jar file, and then run directly, you can get the results. In fact, this is also a method for running java programs in the future. Compile, package, and upload the program and run it. In addition, eclipse connects to Hadoop and can be tested online. The two methods have their own advantages and are not described in detail.

Run the program. We can find the source file WourdCount. java source code in the Hadoop installation directory.

zhangzhen@ubuntu:~/software/hadoop-1.2.1/src/examples/org/apache/hadoop/examples$ pwd /home/zhangzhen/software/hadoop-1.2.1/src/examples/org/apache/hadoop/examples zhangzhen@ubuntu:~/software/hadoop-1.2.1/src/examples/org/apache/hadoop/examples$ 

The following is a copy of the source code to the eclipse program. Use this Code (not modified) to test the actual data and obtain the result. (The comment is the explanation of the previous line)

Import java. io. IOException; import java. util. stringTokenizer; import org. apache. hadoop. conf. configuration; import org. apache. hadoop. fs. path; import org. apache. hadoop. io. intWritable; import org. apache. hadoop. io. text; import org. apache. hadoop. mapreduce. job; import org. apache. hadoop. mapreduce. mapper; import org. apache. hadoop. mapreduce. reducer; import org. apache. hadoop. mapreduce. lib. input. fileInputForm At; import org. apache. hadoop. mapreduce. lib. output. fileOutputFormat; import org. apache. hadoop. util. genericOptionsParser; public class Wordcount {public static class TokenizerMapper extends Mapper <Object, Text, Text, IntWritable> {// specifies the data types used in map, here, Text is equivalent to the String IntWritable in jdk, which is equivalent to the jdk int type. // the main reason for doing so is hadoop Data ordering. Private final static IntWritable one = new IntWritable (1); // an IntWritable variable for recording. Each key appears, give it a value of value = 1 private Text word = new Text (); // used to save the key value in map output, public void map of Text type (Object key, Text value, context context) throws IOException, InterruptedException {// This is the map function, which corresponds to the Mapper abstract class. Here the Object key, Text value type and the above Object, // Text is the same and best. Otherwise, an error is reported during most cases. StringTokenizer itr = new StringTokenizer (value. toString (); // The value read by Hadoop is in the unit of action, and its key is the row number corresponding to the row, because we need to calculate the number of each word, // space is used as the interval by default. Therefore, StringTokenizer can be used to split strings. You can also use string. split. While (itr. hasMoreTokens () {// traverse the word in each string. set (itr. nextToken (); // if a word appears, set it to a key and set its value to 1 context. write (word, one); // The key/value set in the output // The above is the process of map hang} public static class IntSumReducer extends Reducer <Text, IntWritable, Text, intWritable> {// reduce static class. The function is the same as that in Map. Set the type of input/output value private IntWritable result = new IntWritable (); public void reduce (Text key, Iterable <IntWritable> values, Con Text context) throws IOException, InterruptedException {int sum = 0; for (IntWritable val: values) {// as a result of the scattered map, here we will get an example, {key, values }={ "hello ,....}}, such a set sum + = val. get (); // here we need to extract their values one by one and add them together to get the total number of occurrences, that is, the sum of the result. set (sum); // obtain the sum of values and set it to the context value corresponding to the result. write (key, result); // at this time, the key is the output key after the map is scattered. It does not change. When the map is changed, the result is a set of numbers, // The sum has been calculated and output as the key/value.} Public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); // obtain the system parameter String [] otherArgs = new GenericOptionsParser (conf, args ). getRemainingArgs (); if (otherArgs. length! = 2) {// check whether the command line input path/output path is complete, that is, whether it is two parameters: System. err. println ("Usage: wordcount <in> <out>"); System. exit (2); // exit} Job job Job = new Job (conf, "word count") if there are not two parameters; // execution of this program, hadoop is a Job, so it initializes the job. setJarByClass (Wordcount. class); // it can be considered as, this program needs to execute MyWordCount. class: the bytecode file job. setMapperClass (TokenizerMapper. class); // in this job, I use the map function job of the TokenizerMapper class. setCombinerClass (IntSumReducer. class); job. se TReducerClass (IntSumReducer. class); // in this job, I use the reduce function job of the IntSumReducer class. setOutputKeyClass (Text. class); // In reduce output, the output type of the key is Text job. setOutputValueClass (IntWritable. class); // In reduce output, the output type of value is IntWritable FileInputFormat. addInputPath (job, new Path (otherArgs [0]); // initialize the file Path FileOutputFormat for word calculation. setOutputPath (job, new Path (otherArgs [1]); // initialize System. exit (job. WaitForCompletion (true )? 0: 1); // submit the job to hadoop for execution, which means that if the job is actually finished, the main function exits, if the execution is not completed, the system exits.} // Reference: http://hi.baidu.com/erliang20088/item/ce550f2f088ff1ce0e37f930}

The secret hidden in the WourdCount Program

1. Specific process:

1) split the file into splits. Because the file used for testing is small, each file is split into one and the file is split by line to form a <key, value> pair, for example. This step is automatically completed by the MapReduce framework. The offset (that is, the key value) includes the number of characters that the carriage return occupies and is related to the Linux environment.


2) process the split <key, value> pairs to the User-Defined map method and generate a new <key, value> pair.


3) after the <key, value> pairs output by the map method are obtained, Mapper sorts them by key value and executes the Combine process to accumulate keys to the same value, obtain the final output result of Mapper.



2. Overall Map Task process:

It can be summarized into five steps:

1) Read: The Map Task parses keys/values from the input InputSplit using the RecordReader compiled by the user.

2) Map: in this phase, the parsed key/value is mainly handed over to the map () function compiled by the user, and a series of key/value are generated.

3) Collect: In the map () function compiled by the user, when the data processing is complete, the OutputCollector. collect () Input result is generally called. In this function, it will partition the generated key/value (through Partitioner) and write it into a ring memory buffer.

4) Spill: "Overwrite". When the ring buffer is full, MapReduce writes data to the local disk to generate a temporary file. Before writing data to a local disk, you must first sort the data locally and merge and compress the data if necessary.

5) Combine: After all data processing is complete, Map Task merges all temporary files once to ensure that only one data file is generated.

3. Overall Reduce process:

It can be summarized into five steps:

1) Shuffle: Also called Copy stage. The Reduce Task remotely copies a piece of data from each Map Task and writes the data to the disk if the size exceeds the threshold. Otherwise, the data is directly stored in the memory.

2) Merge: During remote copying, the Reduce Task starts two background threads to Merge the memory and files on the disk to prevent excessive memory usage or excessive files on the disk.

3) Sort: According to the MapReduce semantics, the input data of the reduce () function compiled by the user is a group of data aggregated by key. To aggregate data with the same key, Hadoop adopts a sort-based policy. Since Map tasks have implemented partial sorting of their processing results, Reduce tasks only need to merge and sort all the data.

4) Reduce: in this phase, Reduce tasks hand over each group of data to the reduce () function compiled by the user.

5) Write: The reduce () function writes the calculation result to HDFS.

Through some blog examples of WourdCount, We will summarize the whole process of Map and Reduce. Add Thinking in BigDate (8) Big Data Hadoop core architecture HDFS + MapReduce + Hbase + Hive internal mechanism details to roughly repeat the entire file data processing process. However, there are still many details not explained. Such as Spill, Combine, and Shuffle processes. Shuffle is the core of MapReduce. Next, we will have a better understanding of the MapReduce process and a better understanding, so that we can optimize the system and even modify the Hadoop source code in the future when operating the Hadoop cluster.


 

CopyrightBUAA


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.