Computing models from WordCount to MapReduce

Last Update:2016-05-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Overview

Although it is now said that the Big memory era, but the development of memory can not keep up with the pace of data it. So we're going to try to reduce the amount of data. The reduction here is not really a reduction in the amount of data, but rather a dispersion of data. stored separately, calculated separately. This is the core of MapReduce distributed.

Copyright belongs to the author.
Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.
This article Coding-naga
Published: May 10, 2016
This article link: http://blog.csdn.net/lemon_tree12138/article/details/51367732
Source: CSDN
Read MORE: Hadoop for classification >> Big Data

Directory

Overview
Copyright notice
Directory
About MapReduce
MapReduce principle
WordCount Program
- Demand analysis
- Logical implementation
  - Mapper
  - Reducer
  - Client
- Run locally
- Distributed operation
  - Packaged
  - Uploading source data
  - Distributed operation
  - Results window
Ref

About MapReduce

To understand MapReduce, first understand what the vectors of MapReduce are. In Hadoop, the machine used to perform the MapReduce task has two roles: one is Jobtracker and the other is tasktracker. Jobtracker is used for management and scheduling work, and Tasktracker is used to perform work. There is only one jobtracker in a Hadoop cluster (of course, in Hadoop 2.x, there may be multiple jobtracker in a Hadoop cluster).

MapReduce principle

The essence of the MapReduce model lies in its algorithmic thinking--division. The process of splitting can be found in one of my previous blogs, big Data algorithms: Sorting 500 million of data. There is the ability to learn the sorting algorithm in the collation of the sorting algorithm is based on the idea of division.
To get back to the point, in the MapReduce model, the concept of division and governance can be manifested vividly. When it comes to processing large amounts of data (say, 1 TB, you don't have to say that you don't have that much data, big companies don't have that kind of data), and if you simply rely on our hardware, it's a bit out of the way. First of all, our memory is not so large, such as on disk processing, so too much IO operation is undoubtedly a dead hole. Smart Google engineers always surprise us with these slag, they want to spread this data on many machines, perform some preliminary calculations on these machines, then go through a series of summaries, and finally master/namenode the results on our machines.
You know we can't spread our data over random N machines. Then we have to build a reliable connection between these machines, which forms a cluster of computers. This allows our data to be distributed to individual computers in the cluster. In Hadoop this operation can be achieved through the -put directive, which is also reflected in the following operations.
Once the data has been uploaded to the HDFS file system on Hadoop, it is possible to read the data into memory through the Mapper in the MapReduce model, as follows:

After Mapper processing, the data will become

Well, when we get here, the MAP process is over. The next step is the process of Reduce.

Can see there is a conbin process, this process, can also not. And sometimes it must not have, in the back we can be alone to say here conbin, but not the content of this article, it is not detailed.
So the entire MapReduce process is over, look at the specific implementation and test results.

WordCount Program

The MapReduce calculation model for WordCount can be found in my online drawing tool: Https://www.processon.com/view/572bf161e4b0739b929916ea

Demand analysis

Now there are a lot of files
There are a lot of words in each file.
Ask to count the frequency of each word

Logic Implementation Mapper

 Public Static  class coremapper extends Mapper<Object, Text,  Text, intwritable> {        Private Static FinalIntwritable one =NewIntwritable (1);Private StaticText label =NewText ();@Override        protected void Map(Object key, text value, Mapper<object, text, text, Intwritable>. Context context)throwsIOException, interruptedexception {StringTokenizer Tokenizer =NewStringTokenizer (Value.tostring ()); while(Tokenizer.hasmoretokens ())                {Label.set (Tokenizer.nexttoken ());            Context.write (label, one); }        }    }

Reducer

 Public Static  class corereducer extends Reducer<Text, intwritable , Text, intwritable> {        PrivateIntwritable count =NewIntwritable ();@Override        protected void Reduce(text key, iterable<intwritable> values, Reducer<text, intwritable, text, Intwritable>. Context context)throwsIOException, Interruptedexception {if(NULL= = values) {return; }intsum =0; for(Intwritable intwritable:values)            {sum + = Intwritable.get ();            } count.set (sum);        Context.write (key, Count); }    }

Client

 Public  class computerclient extends Configuration implements Tool  {     Public Static void Main(string[] args) {Computerclient client =NewComputerclient (); args =NewString[] {appconstant.input, appconstant.output};Try{toolrunner.run (client, args); }Catch(Exception e)        {E.printstacktrace (); }    }@Override     PublicConfigurationgetconf() {return  This; }@Override     Public void setconf(Configuration arg0) {    }@Override     Public int Run(string[] args)throwsException {Job Job =NewJob (getconf (),"Computerclient-job");        Job.setjarbyclass (Corecomputer.class);        Job.setmapperclass (CoreComputer.CoreMapper.class);        Job.setcombinerclass (CoreComputer.CoreReducer.class);        Job.setreducerclass (CoreComputer.CoreReducer.class);        Job.setoutputkeyclass (Text.class);        Job.setoutputvalueclass (Intwritable.class); Fileinputformat.addinputpath (Job,NewPath (args[0])); Fileoutputformat.setoutputpath (Job,NewPath (args[1]));returnJob.waitforcompletion (true) ?0:1; }}

Run locally

There's nothing to say about local running, either configure the runtime parameters in Eclipse or specify the input/output path directly in the code. Then Run as a Hadoop program.

Distributed operation

In the process of distributed running MapReduce, there are several main steps:
1. Packaging
2. Uploading the source data
3. Distributed operation

Packaged

During the packaging process, you can use the command line to package, or you can use the Export that comes with Eclipse. The process of packaging and exporting a Java jar is the same as in the package export of Eclipse. There's not much to say here. Suppose we hit the jar package as: Job.jar

Uploading source data

Uploading the source data refers to uploading the local data to the HDFS file system.
Before uploading the source data, we need to create a new target path on HDFS that you need to upload, and then use the following command to upload the data.

<hdfs_input_path><local_path><hdfs_input_path>

If you do not create a directory before this, the upload process will be unexpected because the directory is not found.
Once the data is uploaded, the data will be distributed across the DataNode of your cluster, not just on your local machine.

Distributed operation

When everything above is ready, you can use the following Hadoop commands to run our Hadoop program.

<hdfs_input_path><hdfs_output_path>

Results window

Open your browser
Here is the process of executing in the program, the change of the progress

The following is the Web page at the completion of the program execution

Ref

"Hadoop Combat"

Computing models from WordCount to MapReduce

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Computing models from WordCount to MapReduce

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Computing models from WordCount to MapReduce

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support