Brief introduction of MapReduce and HDFs
What is Hadoop?
&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; Google proposes a programming model for its business needs MapReduce and Distributed file systems Google File system, and publishes relevant papers (available on Google Research's web site: GFS, MapReduce). Doug Cutting and Mike Cafarella the two papers when they developed the search engine Nutch, the MapReduce and HDFs of the same name, together with Hadoop.
MapReduce's data flow is shown in the following figure, the original is processed by mapper, then partition and sort, arrives reducer, outputs the final result.
Pictures from hadoop:the Definitive Guide
Hadoop Streaming principle
Hadoop itself is developed in Java, and programs need to be written in Java, but with Hadoop streaming, we can write programs in any language to allow Hadoop to run.
The relevant source code for the Hadoop streaming can be viewed in the GitHub repo of Hadoop. Simply put, you pass the Mapper and reducer written in other languages to a written Java program (Hadoop's own *- Streaming.jar), the Java program is responsible for creating the Mr Job, another process to run the mapper, the resulting input passed to it through stdin, and then the mapper output to the STDOUT data to Hadoop, Partition and sort, and then open the process to run the reducer, the same through Stdin/stdout to get the final results. Therefore, we only need to write in other languages in the program, through stdin to receive data, and then the processed data output to Stdout,hadoop streaming through this Java wrappers help us solve the tedious steps in the middle, run distributed programs.
Pictures from hadoop:the Definitive Guide
In principle, as long as you can handle stdio language can be used to write mapper and reducer, you can also specify mapper or reducer for Linux programs (such as awk, grep, cat) or in a certain format to write Java class. Therefore, mapper and reducer also need not be the same class of procedure.
The advantages and disadvantages of the
Hadoop streaming are that you can write MapReduce programs in your favorite language (in other words, you don't have to write Java XD) to import a whole bunch of libraries like a Java Mr Program, and make a bunch of configuration in your code, A lot of things are abstracted to the stdio, the amount of code is significantly reduced because there is no library dependencies, debugging convenience, and can be separated from Hadoop first in the local pipe simulation debugging shortcomings can only control the MapReduce framework through command-line parameters, unlike Java programs in the code to use the API, Weak control, some things beyond the middle of a layer of processing, efficiency will be relatively slow
So Hadoop streaming is better suited for simple tasks like writing scripts with one hundred or two hundred lines in Python. If the project is more complex, or needs to be more detailed optimization, the use of streaming will be prone to some of the place.
to write a simple Hadoop streaming program in Python
Here are two examples:
Michael Noll Word Count Program hadoop:the definitive Guide Routines
There are a few things to note about using Python to write Hadoop streaming programs:
in the case of the use of iterator, as far as possible to use the iterator, to avoid the stdin input a large number of storage in memory, otherwise it will severely reduce performance streaming will not help you split key and value pass in, passed in just a string, You need to manually call split in your code () the end of each row of data from stdin seems to have \ nthe insurance generally requires the use of Rstrip () to get rid of K list instead of processing Key-value pair You can use GroupBy with Itemgetter to make a group of the same K pair, and get a Java-like reduce to get a text-type key and a iterable effect as value. Note that Itemgetter is more efficient than lambda expressions, so if the requirements are not very complex, try to use itemgetter better.
The basic template I wrote for the Hadoop streaming program was
#!/usr/bin/env python#-*-coding:utf-8-*-"" Some description here ... "" "Import sysfrom operator import Itemgetterfrom Itertools Import groupbydef read_input (file): "" "Read input and Split." "For line in File:yield Line.rstrip (). Split (' t ') def main (): data = Read_input (Sys.stdin) for key, Kviter in GroupBy (data, Itemgetter (0)): # some code here. if __name__ = = "__main__": Main ()
If the input output format differs from the default control, it is mainly adjusted in Read_input ().
Local Debugging
The basic pattern for local debugging Python programs for Hadoop streaming is:
$ cat <input path> | Python <path to mapper script> | Sort-t $ ' t '-k1,1 | Python <path to reducer script> > <output path>
Or if you don't want to use the extra cat, you can use the < orientation
$ python <path to mapper script> < <input path> | Sort-t $ ' t '-k1,1 | Python <path to reducer script> > <output path>
Here are a few points to note:
Hadoop by default According to the tab to split key and value, the first split out of the key, by key to sort, so use here
Sort-t $ ' t '-k1,1
To simulate. If you have other requirements, you can adjust the command-line arguments when you give it to the Hadoop streaming, and local debugging can be adjusted, mainly by adjusting the sort parameters. Therefore, in order to be proficient in local debugging, it is recommended to master the use of the sort command.
If you add shebang to your Python scripts and you have added execution permissions to them, you can also use the same
/mapper.py
To replace
python mapper.py
Original link: http://www.cnblogs.com/joyeecheung/p/3757915.html