MapReduce is one of the first steps to achieve Word Frequency Statistics, mapreduce Word Frequency
Original podcast. If you need to reprint it, please indicate the source. Address: http://www.cnblogs.com/crawl/p/7687120.html
Certificate ----------------------------------------------------------------------------------------------------------------------------------------------------------
A large number of code examples are provided in the notes. It should be noted that most of the code examples are the code I typed and tested. If you have any shortcomings, please correct me ~
All the comments in this blog only represent the opinions of the bloggers themselves, if you have doubts or need to share the information tools in this series, please contact qingqing_crawl@163.com
Certificate -----------------------------------------------------------------------------------------------------------------------------------------------------------
This month, I can't leave it empty to write a blog. Recently I have been developing an online service hall for my school. I have to attend classes and perform tasks. I am very busy and under a lot of pressure, finally, we took some time out on the last day of this month. In fact, this podcast has been in my draft box. LZ wanted to carefully write the Hadoop pseudo-distributed deployment and installation, and then introduce some HDFS content before introducing MapReduce, if it is not empty, let's get started with MapReduce today.
1. MapReduce Overview
1. MapReduce is a distributed computing model proposed by Google. It is mainly used in the search field to solve the computing problem of massive data.
2. MapReduce consists of two phases: Map and Reduce. You only need to implement the map () and reduce () functions to implement distributed computing.
II. Specific implementation
1. First, let's take a look at the package structure of this application in Eclipse.
2. Job processing class for creating map: WCMapper
/** 1. among the four generic types of the ER er Class, the first two specify the Mapper input data type, and the last two specify the mapper output data type * KEYIN is the input key type, VALUEIN is the type of input value * KEYOUT is the type of output key, VALUEOUT is the type of output value * 2.map and reduce data input and output are encapsulated in the form of key-value pairs * 3. by default, in the input data transmitted by the Framework to our mapper, the key is the starting offset of a row in the text to be processed, which is of the Long type. * The content of this row is value, it is a String type * 4. the assignment of the next two generics requires us to combine the actual situation * 5. to make serialization more efficient during network transmission, Hadoop encapsulates Long in Java as LongWritable and String as Text */public class WCMapper extends Mapper <LongWritable, Text, Text, longWritable> {// rewrite the map Method in ER er. The MapReduce Framework calls this method once every row of data is read. @ Override protected void map (LongWritable key, Text value, Context context) throws IOException, interruptedException {
// Write the specific business logic. The data to be processed by the business has been passed in by the framework, that is, the key and value in the parameters of the method are the starting offset of the Data row, value is the text content of this line. // 1. convert the content of a line of the Text type to String type String line = value. toString (); // 2. use StringUtils to split the String with spaces and return String [] String [] words = StringUtils. split (line, ""); // 3. cyclically traverse String [], call the writer () method of context, and output the form of a key-value Pair // key: word value: 1 for (String word: words) {context. write (new Text (word), new LongWritable (1 ));}}}
2. Create a reduce task processing class: WCReducer:
/** 1. In the four generics of the Reducer class, the first two inputs correspond to the Mapper output. The output must be customized according to the actual situation */public class WCReducer extends CER <Text, LongWritable, Text, LongWritable> {// after the Framework finishes map Processing, cache all kv pairs, group them, and pass them to a group (<key, {values}>, for example, <"hello", {,}> ), // call this method @ Override protected void reduce (Text key, Iterable <LongWritable> values, Context context) throws IOException, InterruptedException {// 1. define a counter long count = 0; // 2. traverse the values list and perform the accumulate sum for (LongWritable value: values) {// use the get () method of LongWritable to convert a LongWritable type to a Long type count + = value. get ();} // 3. output the statistical result context of this word. write (key, new LongWritable (count ));}}
3. Create a class to describe a specific job: WCRunner (such LZ is not written in the standard mode)
/*** This class is used to describe a specific job * example: 1. which class is used for the job as the map in the logic processing and which is used as the reduce * 2. path of the data to be processed by the job * 3. specify the path to which the output result of the job is stored */public class WCRunner {public static void main (String [] args) throws Exception {// 1. get Job object: Use the static getInstance () method of Job to pass in Configuration object Configuration conf = new Configuration (); Job wcJob = Job. getInstance (conf); // 2. set the jar package of the class used by the entire Job: Use the setJarByClass () of the Job to pass in the current class. class wcJob. setJarByClass (WCRunner. class); // 3. sets the mapper and reducer class wcJob used by this Job. setMapperClass (WCMapper. class); wcJob. setReducerClass (WCReducer. class); // 4. specify the kv type of reducer output data. Note: If the kv type of mapper and reducer output data is the same, you can use the following two lines of code to set wcJob. setOutputKeyClass (Text. class); wcJob. setOutputValueClass (LongWritable. class); // 5. specifies the kv type wcJob Of The mapper output data. setMapOutputKeyClass (Text. class); wcJob. setMapOutputValueClass (LongWritable. class); // 6. specify the original input data storage path: Use the setInputPaths () method FileInputFormat of FileInputFormat. setInputPaths (wcJob, new Path ("/wc/srcdata/"); // 7. specify the path for storing the processing result: Use the setOutputFormat () method of FileOutputFormat. setOutputPath (wcJob, new Path ("/wc/output/"); // 8. submit the Job to the cluster for running. If the parameter is set to true, the running status wcJob is displayed. waitForCompletion (true );}}
4. Export this project as a jar File
Step: Right-click the project ---> Export ---> Java ---> JAR file ---> specify the Export path (I specified e: \ wc. jar) ---> Finish
5. Upload the exported jar package to linux
LZ uses the following methods: Use the Alt + p shortcut in the SecureCRT client to open the terminal for file upload. Enter put e "\ wc. jar to upload the file.
6. Create an initial test file: words. log
Command: vi words. log enter test data by yourself.
7. Create a directory for storing the initial test file words. log in hdfs: we specify/wc/srcdata/In WCRunner/
Command:
[Hadoop @ crawl ~] $ Hadoop fs-mkdir/wc
[Hadoop @ crawl ~] $ Hadoop fs-mkdir/wc/srcdata
8. Upload the initial test file words. log to the corresponding hdfs directory.
Command: [hadoop @ crawl ~] $ Hadoop fs-put words. log/wc/srcdata
9. Run the jar File
Command: hadoop jar wc. jar com. software. hadoop. mr. wordcount. WCRunner
This command adds the full Class Name of the WCRunner class to hadoop jar wc. jar. The program entry is the main method in WCRunner. After running this command, you can see the output log information:
Then go to the previously configured path for storing the output result (/wc/output/set before LZ) to see the MapReduce execution result.
Run the following command: hadoop fs-ls/wc/output/to view the content in the/wc/output/path:
The result data is in the second file, enter the command: hadoop fs-cat/wc/output/part-r-00000 to view:
So far, our small application has been completed. Is it very interesting? LZ encountered a small accident during implementation:
LZ found that this was an error due to inconsistent jdk versions. After the jdk version is unified, there will be no problem.