Writing distributed programs with Python + Hadoop

Source: Internet
Author: User
Keywords Write can Java pass debug
Tags business business needs cat code command-line arguments control cutting data

What is Hadoop?

Google proposes a programming model for its business needs MapReduce and Distributed file systems Google File system, and publishes relevant papers (available on Google Research's web site: GFS, MapReduce). Doug Cutting and Mike Cafarella the two papers when they developed the search engine Nutch, the MapReduce and HDFs of the same name, together with Hadoop.

MapReduce's data flow is shown in the following figure, the original is processed by mapper, then partition and sort, arrives reducer, outputs the final result.


Pictures from hadoop:the Definitive Guide

The principle of Hadoop streaming

Hadoop itself is developed in Java, and programs need to be written in Java, but with Hadoop streaming, we can write programs in any language to allow Hadoop to run.

The relevant source code for the Hadoop streaming can be viewed in the GitHub repo of Hadoop. In simple terms, this Java program is responsible for creating Mr Jobs by passing parameters to a written Java program (the *-Streaming.jar of Hadoop) by mapper and reducer written in other languages. Another process to run the mapper, the resulting input through stdin to it, and then mapper processed after the output to the STDOUT data to hadoop,partition and sort, and then open the process to run reducer, the same way through stdin/ StdOut get the final result. Therefore, we only need to write in other languages in the program, through stdin to receive data, and then the processed data output to Stdout,hadoop streaming through this Java wrappers help us solve the tedious steps in the middle, run distributed programs.


Pictures from hadoop:the Definitive Guide

In principle, as long as you can handle stdio language can be used to write mapper and reducer, you can also specify mapper or reducer for Linux programs (such as awk, grep, cat) or in a certain format to write Java class. Therefore, mapper and reducer also need not be the same class of procedure.

Advantages and disadvantages of Hadoop streaming


You can write MapReduce programs in your favorite language (in other words, you don't have to write Java XD)

You don't need to import a bunch of libraries like a Java Mr Program, do a lot of configuration in your code, and a lot of things are abstracted to the stdio, and the amount of code is significantly reduced

Because there is no library dependencies, debugging is convenient, and can be disconnected from Hadoop first in the local pipe simulation debugging


You can only control the MapReduce framework with command-line arguments, unlike Java programs that use APIs in your code, weak control, some things beyond

Because the middle is separated by a layer of processing, the efficiency will be relatively slow

So Hadoop streaming is better suited for simple tasks like writing scripts with one hundred or two hundred lines in Python. If the project is more complex, or needs to be more detailed optimization, the use of streaming will be prone to some of the place.

Write a simple Hadoop streaming program in Python

Here are two examples:

Michael Noll Word Count program

The routine of hadoop:the definitive guide

There are a few things to note about using Python to write Hadoop streaming programs:

If you can use iterator, use iterator as much as possible to avoid storing stdin input in memory, which can severely degrade performance

Streaming won't help you split key and value pass in, pass in only a string only, need you in code manually call split ()

The end of each row of data from stdin appears to have \ n, and insurance generally requires the use of Rstrip () to remove

When you want to get K list instead of processing Key-value pair, you can use GroupBy with Itemgetter to make a group with the same K pair as the key. With a Java-like write, reduce can directly obtain a text-type key and a iterable as the effect of value. Note that Itemgetter is more efficient than lambda expressions, so if the requirements are not very complex, try to use itemgetter better.

The basic template I wrote for the Hadoop streaming program was


#!/usr/bin/env python



Some description here ...


Import Sys

From operator Import Itemgetter

From Itertools import GroupBy

def read_input (file):

"" "Read input and split." "

For line in file:

Yield Line.rstrip (). Split (' t ')

def main ():

data = Read_input (Sys.stdin)

For key, Kviter in GroupBy (data, Itemgetter (0)):

# some code here.

if __name__ = = "__main__":

Main ()


If the input output format differs from the default control, it is mainly adjusted in Read_input ().

Local debugging

The basic pattern for local debugging Python programs for Hadoop streaming is:

$ cat | Python | Sort-t $ ' t '-k1,1 | Python >

Or if you don't want to use the extra cat, you can use the < orientation

$ Python < | Sort-t $ ' t '-k1,1 | Python >

Here are a few points to note:

Hadoop by default According to the tab to split key and value, the first split out of the key, by key to sort, so use here

Sort-t $ ' t '-k1,1

To simulate. If you have other requirements, you can adjust the command-line arguments when you give it to the Hadoop streaming, and local debugging can be adjusted, mainly by adjusting the sort parameters. Therefore, in order to be proficient in local debugging, it is recommended to master the use of the sort command.

If you add shebang to your Python scripts and you have added execution permissions to them, you can also use the same


To replace

Python mapper.py

Original link: http://www.cnblogs.com/joyeecheung/p/3757915.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.