Distributed programming with Python + Hadoop streaming (i)-Introduction to Principles, sample programs and local debugging

Source: Internet
Author: User
About MapReduce and HDFs
What is Hadoop?

Google has proposed programming model MapReduce and Distributed file system for its business needs, and published relevant papers (available on Google Research's website: GFS, MapReduce). Doug Cutting and Mike Cafarella made their own implementations of the two papers when they developed the search engine Nutch, namely, MapReduce and HDFs, which together are Hadoop.

The data flow of mapreduce, for example, was processed by mapper, then partition and sort, reaching reducer and outputting the final result.

Photo from hadoop:the Definitive Guide

Hadoop Streaming principle
Hadoop itself is developed in Java, and the program needs to be written in Java, but with Hadoop streaming, we can write programs in any language and let Hadoop run.

The relevant source code for Hadoop streaming can be viewed on the GitHub repo of Hadoop. Simply put, mapper and reducer written in other languages are passed through parameters to a pre-written Java program (Hadoop comes with a * Streaming.jar), this Java program will be responsible for creating the Mr Job, another process to run the mapper, the resulting input passed stdin to it, and then mapper processing output to stdout after the data sent to hadoop,partition and sort, and then open the process The reducer, the same as the stdin/stdout to get the final result. Therefore, we only need in other languages written in the program, through the stdin to receive data, and then output the processed data to Stdout,hadoop streaming can help us through this Java wrapper to solve the tedious steps, run the distributed program.

Photo from hadoop:the Definitive Guide

In principle, as long as the language capable of handling stdio can be used to write mapper and reducer, you can also specify mapper or reducer for Linux programs (such as awk, grep, cat) or Java class written in a certain format. Therefore, mapper and reducer do not have to be the same class of procedures.

Advantages and disadvantages of Hadoop streaming

Advantages

You can use your favorite language to write a mapreduce program (in other words, you don't have to write Java XD)
There's no need to import a bunch of libraries like a Java Mr Program, a lot of configuration in the code, lots of things abstracted into the stdio, significantly less code.
Because there is no dependency on the library, debugging is convenient, and can be disconnected from Hadoop first in the local pipeline simulation debugging

Disadvantages

The MapReduce framework can only be controlled by command-line arguments, unlike Java programs that use APIs in code that have weak control power and something beyond
Because there is a layer of processing in the middle, the efficiency will be relatively slow
So Hadoop streaming is a great place to do simple tasks like writing a one hundred or two hundred-line script in Python. If the project is complex, or requires more detailed optimization, it is easy to use streaming to shackled some places.

Writing a simple Hadoop streaming program in Python

Two examples are provided here:

Michael Noll's Word Count program
Hadoop:the Definitive Guide's routines
There are a few things to note about using Python to write Hadoop streaming programs:

If you can use iterator, try to use iterator to avoid storing stdin input in memory, otherwise it will severely degrade performance.

Streaming will not help you to split the key and value to pass in, passing in only a string, you need to manually call split in the code ()

Every line of data obtained from stdin appears to have a \ n at the end, so it is generally necessary to use Rstrip () to remove

When you want to get k-v list instead of dealing with key-value pair, you can use GroupBy with Itemgetter to make a group of k-v pair with the same key. To get a Java-like write, reduce can directly get a text type of key and a iterable as the effect of value. Note that Itemgetter is more efficient than lambda expressions, so if the requirements are not very complex, try to use itemgetter better.

My basic template when I was writing a Hadoop streaming program was

#!/usr/bin/env python#-*-coding:utf-8-*-"" "Some description here ..." "" Import sysfrom operator import Itemgetterfrom it Ertools Import groupbydef read_input (file): "" "Read input and Split." "For line in file:  yield Line.rstrip (). Split (' \ T ') def main (): data = Read_input (Sys.stdin) for key, Kviter in GroupBy (data, Itemgetter (0)):  # some code here. if __name__ = = "__main__": Main ()

If the input and output format is different from the default control, it is mainly adjusted in Read_input ().

Local debugging

The basic pattern for locally debugging Python programs for Hadoop streaming is:

| Python 
 
  
   
   | sort-t $ ' \ t '-k1,1 | python 
  
   
    
    > 
   
    
  
   
 
  

Or if you don't want to use a redundant cat, you can use<>< p=""><>

$ python 
 
  
   
   < 
  
   | Sort-t $ ' \ t '-k1,1 | Python 
  
   
    
    > 
   
    
  
   
 
  

Here are a few things to note:

Hadoop defaults to tab to split the key and value, with the first split part as key, sorted by key, so this is used

Sort-t $ ' \ t '-k1,1
To simulate. If you have additional requirements, you can adjust the command line parameters when handing over to Hadoop streaming, and local debugging can be adjusted accordingly, mainly to adjust the sort parameters. Therefore, in order to be proficient in local debugging, it is recommended to first grasp the use of the sort command.

If you add shebang to the Python script, and you have added execute permissions for them, you can also use a similar

./mapper.py

To replace

Python mapper.py
  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.