Overview
Although it is now said that the Big memory era, but the development of memory can not keep up with the pace of data it. So we're going to try to reduce the amount of data. The reduction here is not really a reduction in the amount of data, but rather a dispersion of data. stored separately, calculated separately. This is the core of MapReduce distributed.
Copyright notice
Copyright belongs to the author.
Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.
This article Coding-naga
Published: May 10, 2016
This article link: http://blog.csdn.net/lemon_tree12138/article/details/51367732
Source: CSDN
Read MORE: Hadoop for classification >> Big Data
Directory
- Overview
- Copyright notice
- Directory
- About MapReduce
- MapReduce principle
- WordCount Program
- Demand analysis
- Logical implementation
- Run locally
- Distributed operation
- Packaged
- Uploading source data
- Distributed operation
- Results window
- Ref
About MapReduce
To understand MapReduce, first understand what the vectors of MapReduce are. In Hadoop, the machine used to perform the MapReduce task has two roles: one is Jobtracker and the other is tasktracker. Jobtracker is used for management and scheduling work, and Tasktracker is used to perform work. There is only one jobtracker in a Hadoop cluster (of course, in Hadoop 2.x, there may be multiple jobtracker in a Hadoop cluster).
MapReduce principle
The essence of the MapReduce model lies in its algorithmic thinking--division. The process of splitting can be found in one of my previous blogs, big Data algorithms: Sorting 500 million of data. There is the ability to learn the sorting algorithm in the collation of the sorting algorithm is based on the idea of division.
To get back to the point, in the MapReduce model, the concept of division and governance can be manifested vividly. When it comes to processing large amounts of data (say, 1 TB, you don't have to say that you don't have that much data, big companies don't have that kind of data), and if you simply rely on our hardware, it's a bit out of the way. First of all, our memory is not so large, such as on disk processing, so too much IO operation is undoubtedly a dead hole. Smart Google engineers always surprise us with these slag, they want to spread this data on many machines, perform some preliminary calculations on these machines, then go through a series of summaries, and finally master/namenode the results on our machines.
You know we can't spread our data over random N machines. Then we have to build a reliable connection between these machines, which forms a cluster of computers. This allows our data to be distributed to individual computers in the cluster. In Hadoop this operation can be achieved through the -put directive, which is also reflected in the following operations.
Once the data has been uploaded to the HDFS file system on Hadoop, it is possible to read the data into memory through the Mapper in the MapReduce model, as follows:
After Mapper processing, the data will become
Well, when we get here, the MAP process is over. The next step is the process of Reduce.
Can see there is a conbin process, this process, can also not. And sometimes it must not have, in the back we can be alone to say here conbin, but not the content of this article, it is not detailed.
So the entire MapReduce process is over, look at the specific implementation and test results.
WordCount Program
The MapReduce calculation model for WordCount can be found in my online drawing tool: Https://www.processon.com/view/572bf161e4b0739b929916ea
Demand analysis
- Now there are a lot of files
- There are a lot of words in each file.
- Ask to count the frequency of each word
Logic Implementation Mapper
Public Static class coremapper extends Mapper<Object, Text, Text, intwritable> { Private Static FinalIntwritable one =NewIntwritable (1);Private StaticText label =NewText ();@Override protected void Map(Object key, text value, Mapper<object, text, text, Intwritable>. Context context)throwsIOException, interruptedexception {StringTokenizer Tokenizer =NewStringTokenizer (Value.tostring ()); while(Tokenizer.hasmoretokens ()) {Label.set (Tokenizer.nexttoken ()); Context.write (label, one); } } }
Reducer
Public Static class corereducer extends Reducer<Text, intwritable , Text, intwritable> { PrivateIntwritable count =NewIntwritable ();@Override protected void Reduce(text key, iterable<intwritable> values, Reducer<text, intwritable, text, Intwritable>. Context context)throwsIOException, Interruptedexception {if(NULL= = values) {return; }intsum =0; for(Intwritable intwritable:values) {sum + = Intwritable.get (); } count.set (sum); Context.write (key, Count); } }
Client
Public class computerclient extends Configuration implements Tool { Public Static void Main(string[] args) {Computerclient client =NewComputerclient (); args =NewString[] {appconstant.input, appconstant.output};Try{toolrunner.run (client, args); }Catch(Exception e) {E.printstacktrace (); } }@Override PublicConfigurationgetconf() {return This; }@Override Public void setconf(Configuration arg0) { }@Override Public int Run(string[] args)throwsException {Job Job =NewJob (getconf (),"Computerclient-job"); Job.setjarbyclass (Corecomputer.class); Job.setmapperclass (CoreComputer.CoreMapper.class); Job.setcombinerclass (CoreComputer.CoreReducer.class); Job.setreducerclass (CoreComputer.CoreReducer.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Intwritable.class); Fileinputformat.addinputpath (Job,NewPath (args[0])); Fileoutputformat.setoutputpath (Job,NewPath (args[1]));returnJob.waitforcompletion (true) ?0:1; }}
Run locally
There's nothing to say about local running, either configure the runtime parameters in Eclipse or specify the input/output path directly in the code. Then Run as a Hadoop program.
Distributed operation
In the process of distributed running MapReduce, there are several main steps:
1. Packaging
2. Uploading the source data
3. Distributed operation
Packaged
During the packaging process, you can use the command line to package, or you can use the Export that comes with Eclipse. The process of packaging and exporting a Java jar is the same as in the package export of Eclipse. There's not much to say here. Suppose we hit the jar package as: Job.jar
Uploading source data
Uploading the source data refers to uploading the local data to the HDFS file system.
Before uploading the source data, we need to create a new target path on HDFS that you need to upload, and then use the following command to upload the data.
<hdfs_input_path><local_path><hdfs_input_path>
If you do not create a directory before this, the upload process will be unexpected because the directory is not found.
Once the data is uploaded, the data will be distributed across the DataNode of your cluster, not just on your local machine.
Distributed operation
When everything above is ready, you can use the following Hadoop commands to run our Hadoop program.
<hdfs_input_path><hdfs_output_path>
Results window
Open your browser
Here is the process of executing in the program, the change of the progress
The following is the Web page at the completion of the program execution
Ref
Computing models from WordCount to MapReduce