Using python + hadoopstreaming distributed programming (I)-principles, sample programs and local debugging

Source: Internet
Author: User
Hadoop is an open-source distributed parallel programming framework that implements the MapReduce computing model. with Hadoop, programmers can easily write distributed parallel programs and run them on computer clusters, complete the calculation of massive data. Introduction to MapReduce and HDFS
What is Hadoop?

Google proposed the programming model MapReduce and the distributed File System Google File System for its own business needs, and published relevant papers (available on the Google Research website: GFS, MapReduce ). Doug Cutting and Mike Cafarella made their own implementations of the two papers when developing the search engine Nutch, that is, the same name of MapReduce and HDFS, together Hadoop.

MapReduce Data flow, for example, the original Data is processed by mapper, and then partition and sort are performed to reach CER, and the final result is output.

Picture from Hadoop: The Definitive Guide

Principles of Hadoop Streaming
Hadoop itself is developed in Java and programs also need to be written in Java. However, with Hadoop Streaming, we can write programs in any language for Hadoop to run.

The source code of Hadoop Streaming can be viewed in Hadoop Github repo. Simply put, the er and reducer written in other languages are passed to a Java program written in advance through Parameters (*-streaming in Hadoop. jar), this Java program will be responsible for creating MR Jobs, opening another process to run mapper, and passing the input to it through stdin, after the mapper is processed, the data output to stdout is handed over to Hadoop. after partition and sort, another process is started to run CER, and the final result is also obtained through stdin/stdout. Therefore, we only need to receive data through stdin in programs written in other languages, and then output the processed data to stdout, hadoop streaming can use this Java wrapper to help us solve the complicated steps and run distributed programs.

Picture from Hadoop: The Definitive Guide

In principle, any language capable of processing stdio can be used to write mapper and reducer. you can also specify mapper or reducer as programs (such as awk, grep, and cat) in Linux) or write a java class in a certain format. Therefore, mapper and reducer do not have to be programs of the same class.

Advantages and disadvantages of Hadoop Streaming

Advantages

You can use your preferred language to write MapReduce programs (in other words, you do not need to write Java XD)
You do not need to import a large number of libraries like the MR program that writes Java, and make a lot of configuration in the code. many things are abstracted to stdio, and the amount of code is significantly reduced.
Because there is no library dependency, debugging is convenient, and debugging can be simulated locally using pipelines instead of Hadoop.

Disadvantages

The MapReduce framework can only be controlled through command line parameters. Unlike Java programs, APIs can be used in code, and the control is weak.
Because the process is separated by one layer, the efficiency will be slow.
Therefore, Hadoop Streaming is suitable for some simple tasks, such as using python to write a script with only 100 or 200 rows. If the project is complex or needs to be optimized in a more detailed manner, Streaming is easy to use.

Compile a simple Hadoop Streaming program in python

Two examples are provided here:

Michael Noll's word count program
Hadoop: The routine in The Definitive Guide
Note the following points when writing a Hadoop Streaming program using python:

When iterator can be used, try to use iterator to avoid storing a large number of stdin input in the memory. Otherwise, the performance will be seriously reduced.

Streaming does not allow you to split keys and values, but only strings are passed in. you need to manually call split () in the code ()

\ N appears at the end of each row of data obtained from stdin. to be safe, you generally need to use rstrip () to remove it.

When you want to get a K-V list instead of processing key-value pair one by one, you can use groupby with itemgetter to make up a group of k-v pair with the same key, reduce compiled in Java can directly obtain a key of the Text type and an iterable as the value. Note that itemgetter is more efficient than lambda expressions, so it is better to use itemgetter if the requirement is not very complex.

The basic template for writing a Hadoop Streaming program is

#!/usr/bin/env python# -*- coding: utf-8 -*-"""Some description here..."""import sysfrom operator import itemgetterfrom itertools import groupbydef read_input(file): """Read input and split.""" for line in file:  yield line.rstrip().split('\t')def main(): data = read_input(sys.stdin) for key, kviter in groupby(data, itemgetter(0)):  # some code here..if __name__ == "__main__": main()

If the input/output format is different from the default control, it will be adjusted in read_input.

Local debugging

The basic mode for local debugging of the python program for Hadoop Streaming is:

$ cat  | python 
 
   | sort -t $'\t' -k1,1 | python 
  
    > 
   
  
 

Or you can use <定向< p>

$ python 
 
   < 
   | sort -t $'\t' -k1,1 | python 
  
    > 
   
  
 

Note the following points:

Hadoop splits keys and values by tab by default. The key is sorted by key based on the first split.

Sort-t $ '\ t'-k1, 1
To simulate. If you have other requirements, you can call the command line parameter when handed over to Hadoop Streaming for execution. you can also adjust the local debugging, mainly to adjust the sort parameters. Therefore, to be proficient in local debugging, we recommend that you first understand the usage of the sort command.

If you add shebang to a python script and add execution permissions to it, you can also use

./mapper.py

To replace

python mapper.py

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.