Computing models from WordCount to MapReduce

Source: Internet
Author: User

Overview

Although it is now said that the Big memory era, but the development of memory can not keep up with the pace of data it. So we're going to try to reduce the amount of data. The reduction here is not really a reduction in the amount of data, but rather a dispersion of data. stored separately, calculated separately. This is the core of MapReduce distributed.

Copyright notice

Copyright belongs to the author.
Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.
This article Coding-naga
Published: May 10, 2016
This article link: http://blog.csdn.net/lemon_tree12138/article/details/51367732
Source: CSDN
Read MORE: Hadoop for classification >> Big Data

Directory

    • Overview
    • Copyright notice
    • Directory
    • About MapReduce
    • MapReduce principle
    • WordCount Program
      • Demand analysis
      • Logical implementation
        • Mapper
        • Reducer
        • Client
      • Run locally
      • Distributed operation
        • Packaged
        • Uploading source data
        • Distributed operation
        • Results window
    • Ref

About MapReduce

To understand MapReduce, first understand what the vectors of MapReduce are. In Hadoop, the machine used to perform the MapReduce task has two roles: one is Jobtracker and the other is tasktracker. Jobtracker is used for management and scheduling work, and Tasktracker is used to perform work. There is only one jobtracker in a Hadoop cluster (of course, in Hadoop 2.x, there may be multiple jobtracker in a Hadoop cluster).

MapReduce principle

The essence of the MapReduce model lies in its algorithmic thinking--division. The process of splitting can be found in one of my previous blogs, big Data algorithms: Sorting 500 million of data. There is the ability to learn the sorting algorithm in the collation of the sorting algorithm is based on the idea of division.
To get back to the point, in the MapReduce model, the concept of division and governance can be manifested vividly. When it comes to processing large amounts of data (say, 1 TB, you don't have to say that you don't have that much data, big companies don't have that kind of data), and if you simply rely on our hardware, it's a bit out of the way. First of all, our memory is not so large, such as on disk processing, so too much IO operation is undoubtedly a dead hole. Smart Google engineers always surprise us with these slag, they want to spread this data on many machines, perform some preliminary calculations on these machines, then go through a series of summaries, and finally master/namenode the results on our machines.
You know we can't spread our data over random N machines. Then we have to build a reliable connection between these machines, which forms a cluster of computers. This allows our data to be distributed to individual computers in the cluster. In Hadoop this operation can be achieved through the -put directive, which is also reflected in the following operations.
Once the data has been uploaded to the HDFS file system on Hadoop, it is possible to read the data into memory through the Mapper in the MapReduce model, as follows:

After Mapper processing, the data will become

Well, when we get here, the MAP process is over. The next step is the process of Reduce.

Can see there is a conbin process, this process, can also not. And sometimes it must not have, in the back we can be alone to say here conbin, but not the content of this article, it is not detailed.
So the entire MapReduce process is over, look at the specific implementation and test results.

WordCount Program

The MapReduce calculation model for WordCount can be found in my online drawing tool: Https://www.processon.com/view/572bf161e4b0739b929916ea

Demand analysis
    1. Now there are a lot of files
    2. There are a lot of words in each file.
    3. Ask to count the frequency of each word
Logic Implementation Mapper
 Public Static  class coremapper extends Mapper<Object, Text,  Text, intwritable> {        Private Static FinalIntwritable one =NewIntwritable (1);Private StaticText label =NewText ();@Override        protected void Map(Object key, text value, Mapper<object, text, text, Intwritable>. Context context)throwsIOException, interruptedexception {StringTokenizer Tokenizer =NewStringTokenizer (Value.tostring ()); while(Tokenizer.hasmoretokens ())                {Label.set (Tokenizer.nexttoken ());            Context.write (label, one); }        }    }
Reducer
 Public Static  class corereducer extends Reducer<Text, intwritable , Text, intwritable> {        PrivateIntwritable count =NewIntwritable ();@Override        protected void Reduce(text key, iterable<intwritable> values, Reducer<text, intwritable, text, Intwritable>. Context context)throwsIOException, Interruptedexception {if(NULL= = values) {return; }intsum =0; for(Intwritable intwritable:values)            {sum + = Intwritable.get ();            } count.set (sum);        Context.write (key, Count); }    }
Client
 Public  class computerclient extends Configuration implements Tool  {     Public Static void Main(string[] args) {Computerclient client =NewComputerclient (); args =NewString[] {appconstant.input, appconstant.output};Try{toolrunner.run (client, args); }Catch(Exception e)        {E.printstacktrace (); }    }@Override     PublicConfigurationgetconf() {return  This; }@Override     Public void setconf(Configuration arg0) {    }@Override     Public int Run(string[] args)throwsException {Job Job =NewJob (getconf (),"Computerclient-job");        Job.setjarbyclass (Corecomputer.class);        Job.setmapperclass (CoreComputer.CoreMapper.class);        Job.setcombinerclass (CoreComputer.CoreReducer.class);        Job.setreducerclass (CoreComputer.CoreReducer.class);        Job.setoutputkeyclass (Text.class);        Job.setoutputvalueclass (Intwritable.class); Fileinputformat.addinputpath (Job,NewPath (args[0])); Fileoutputformat.setoutputpath (Job,NewPath (args[1]));returnJob.waitforcompletion (true) ?0:1; }}
Run locally

There's nothing to say about local running, either configure the runtime parameters in Eclipse or specify the input/output path directly in the code. Then Run as a Hadoop program.

Distributed operation

In the process of distributed running MapReduce, there are several main steps:
1. Packaging
2. Uploading the source data
3. Distributed operation

Packaged

During the packaging process, you can use the command line to package, or you can use the Export that comes with Eclipse. The process of packaging and exporting a Java jar is the same as in the package export of Eclipse. There's not much to say here. Suppose we hit the jar package as: Job.jar

Uploading source data

Uploading the source data refers to uploading the local data to the HDFS file system.
Before uploading the source data, we need to create a new target path on HDFS that you need to upload, and then use the following command to upload the data.

<hdfs_input_path><local_path><hdfs_input_path>

If you do not create a directory before this, the upload process will be unexpected because the directory is not found.
Once the data is uploaded, the data will be distributed across the DataNode of your cluster, not just on your local machine.

Distributed operation

When everything above is ready, you can use the following Hadoop commands to run our Hadoop program.

<hdfs_input_path><hdfs_output_path>
Results window

Open your browser
Here is the process of executing in the program, the change of the progress

The following is the Web page at the completion of the program execution

Ref
    • "Hadoop Combat"

Computing models from WordCount to MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.