Using Python + Hadoop streaming distributed programming (i)--Principle introduction, sample program and local debugging _python

Source: Internet
Author: User
Tags stdin in python

Introduction to MapReduce and HDFs
What is Hadoop?

Google proposes a programming model for its business needs MapReduce and Distributed file systems Google File system, and publishes relevant papers (available on Google Research's web site: GFS, MapReduce). Doug Cutting and Mike Cafarella the two papers when they developed the search engine Nutch, the MapReduce and HDFs of the same name, together with Hadoop.

MapReduce's data flow is shown in the following figure, the original is processed by mapper, then partition and sort, arrives reducer, outputs the final result.

Pictures from hadoop:the Definitive Guide

The principle of Hadoop streaming
Hadoop itself is developed in Java, and programs need to be written in Java, but with Hadoop streaming, we can write programs in any language to allow Hadoop to run.

The relevant source code for the Hadoop streaming can be viewed in the GitHub repo of Hadoop. In simple terms, you pass the Mapper and reducer written in other languages to a written Java program (Hadoop's own *- Streaming.jar), this Java program is responsible for creating the Mr Job, another process to run the mapper, the resulting input passed to it through stdin, and then the mapper processing output to the STDOUT data to hadoop,partition and sort, and then open the process Line reducer, the same way through Stdin/stdout to get the final result. Therefore, we only need to write in other languages in the program, through stdin to receive data, and then the processed data output to Stdout,hadoop streaming through this Java wrapper help us solve the tedious steps in the middle, run distributed programs.

Pictures from hadoop:the Definitive Guide

In principle, as long as you can handle stdio language can be used to write mapper and reducer, you can also specify mapper or reducer for Linux programs (such as awk, grep, cat) or in a certain format to write Java class. Therefore, mapper and reducer also need not be the same class of procedure.

Advantages and disadvantages of Hadoop streaming

Advantages

You can write MapReduce programs in your favorite language (in other words, you don't have to write Java XD)
You don't need to import a bunch of libraries like a Java Mr Program, do a lot of configuration in your code, and a lot of things are abstracted to the stdio, and the amount of code is significantly reduced.
Because there is no library dependencies, debugging is convenient, and can be disconnected from Hadoop first in the local pipe simulation debugging

Disadvantages

You can only control the MapReduce framework with command-line arguments, unlike Java programs that use APIs in your code, weak control, and some things beyond the reach of
Because the middle is separated by a layer of processing, the efficiency will be relatively slow
So Hadoop streaming is better suited for simple tasks like writing scripts with one hundred or two hundred lines in Python. If the project is more complex, or needs to be more detailed optimization, the use of streaming will be prone to some of the place.

Write a simple Hadoop streaming program in Python

Two examples are provided here:

Michael Noll Word Count program
The routine of hadoop:the definitive guide
There are a few things to note about using Python to write Hadoop streaming programs:

If you can use iterator, use iterator as much as possible to avoid storing stdin input in memory, which can severely degrade performance

Streaming won't help you split key and value pass in, pass in only a string only, need you in code manually call split ()

The end of each row of data from stdin appears to have \ n, and insurance generally requires the use of Rstrip () to remove

When you want to get k-v list instead of processing Key-value pair, you can use GroupBy with Itemgetter to make a group with the same k-v pair as the key. With a Java-like write, reduce can directly obtain a text-type key and a iterable as the effect of value. Note that Itemgetter is more efficient than lambda expressions, so if the requirements are not very complex, try to use itemgetter better.

The basic template I wrote for the Hadoop streaming program was

#!/usr/bin/env python
#-*-coding:utf-8-*-
"" "
Some description here ...
"

" Import SYS from
operator import itemgetter from
itertools import groupby

def read_input (file): ""
 read Input and split.
 "" " For line in file:
  yield Line.rstrip (). Split (' \ t ')

def Main ():
 data = Read_input (Sys.stdin) for
 key , Kviter in GroupBy (data, Itemgetter (0)):
  # Some code here.

if __name__ = = "__main__":
 Main ()

If the input output format differs from the default control, it is mainly adjusted in Read_input ().

Local debugging

The basic pattern for local debugging Python programs for Hadoop streaming is:

$ Cat <input path> | Python <path to mapper script> | Sort-t $ ' t '-k1,1 | Python <path to reducer script> > <output path>

Or if you don't want to use the extra cat, you can use the < orientation

$ python <path to mapper script> < <input path> | Sort-t $ ' t '-k1,1 | Python <path to reducer script> > <output path>

Here are a few points to note:

Hadoop by default According to the tab to split key and value, the first split out of the key, by key to sort, so use here

Sort-t $ ' t '-k1,1
To simulate. If you have other requirements, you can adjust the command-line arguments when you give it to the Hadoop streaming, and local debugging can be adjusted, mainly by adjusting the sort parameters. Therefore, in order to be proficient in local debugging, it is recommended to master the use of the sort command.

If you add shebang to your Python scripts and you have added execution permissions to them, you can also use the same

./mapper.py

To replace

Python mapper.py

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.