［轉］Writing An Hadoop MapReduce Program In Python

最後更新：2018-12-04 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

Writing An Hadoop MapReduce Program In PythonFrom Michael G. NollJump to: navigation, search

此文轉自http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python

In this tutorial, I will describe how to write a simple MapReduce program for Hadoop in the Python programming language.

Contents

[hide]

1 Motivation
2 What we want to do
3 Prerequisites
4 Python MapReduce Code
- 4.1 Map: mapper.py
- 4.2 Reduce: reducer.py
- 4.3 Test your code (cat data | map | sort | reduce)
5 Running the Python Code on Hadoop
- 5.1 Download example input data
- 5.2 Copy local example data to HDFS
- 5.3 Run the MapReduce job
6 Improved Mapper and Reducer code: using Python iterators and generators
- 6.1 mapper.py
- 6.2 reducer.py
7 Feedback
8 Related Links

Motivation

Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C++ (the latter since version 0.14.1). However, the documentation and the most prominent Python example on the Hadoop home page could make you think that you must translate your Python code using Jython into a Java jar file. Obviously, this is not very convenient and can even be problematic if you depend on Python features not provided by Jython. Another issue of the Jython approach is the overhead of writing your Python program in such a way that it can interact with Hadoop - just have a look at the example in /src/examples/python/WordCount.py and you see what I mean. I still recommend to have at least a look at the Jython approach and maybe even at the new C++ MapReduce API called Pipes, it's really interesting.

Having that said, the ground is prepared for the purpose of this tutorial: writing a Hadoop MapReduce program in a more Pythonic way, i.e. in a way you should be familiar with.

What we want to do

We will write a simple MapReduce program (see also Wikipedia) for Hadoop in Python but without using Jython to translate our code to Java jar files.

Our program will mimick the WordCount example, i.e. it reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.

Note: You can also use programming languages other than Python such as Perl or Ruby with the "technique" described in this tutorial. I wrote some words about what happens behind the scenes. Feel free to correct me if I'm wrong.

Prerequisites

You should have an Hadoop cluster up and running because we will get our hands dirty. If you don't have a cluster yet, my following tutorials might help you to build one. The tutorials are tailored to Ubuntu Linux but the information does also apply to other Linux/Unix variants.

Running Hadoop On Ubuntu Linux (Single-Node Cluster)
How to set up a single-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux

Running Hadoop On Ubuntu Linux (Multi-Node Cluster)
How to set up a multi-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux

Python MapReduce Code

The "trick" behind the following Python code is that we will use HadoopStreaming (see also the wiki entry) for helping us passing data between our Map and Reduce code via STDIN (standard input) and STDOUT (standard output). We will simply use Python's sys.stdin to read input data and print our own output to sys.stdout. That's all we need to do because HadoopStreaming will take care of everything else! Amazing, isn't it? Well, at least I had a "wow" experience...

Map: mapper.py

Save the following code in the file /home/hadoop/mapper.py. It will read data from STDIN (standard input), split it into words and output a list of lines mapping words to their (intermediate) counts to STDOUT (standard output). The Map script will not compute an (intermediate) sum of a word's occurences. Instead, it will output " 1" immediately - even though the might occur multiple times in the input - and just let the subsequent Reduce step do the final sum count. Of course, you can change this behavior in your own scripts as you please, but we will keep it like that in this tutorial because of didactic reasons :-)

Make sure the file has execution permission (chmod +x /home/hadoop/mapper.py should do the trick) or you will run into problems.

#!/usr/bin/env python
 
import sys
 
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
    line = line.strip()
# split the line into words
    words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s/t%s' % (word, 1)

Reduce: reducer.py

Save the following code in the file /home/hadoop/reducer.py. It will read the results of mapper.py from STDIN (standard input), and sum the occurences of each word to a final count, and output its results to STDOUT (standard output).

Make sure the file has execution permission (chmod +x /home/hadoop/reducer.py should do the trick) or you will run into problems.

#!/usr/bin/env python
 
from operator import itemgetter
import sys
 
# maps words to their counts
word2count = {}
 
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
    line = line.strip()
 
# parse the input we got from mapper.py
    word, count = line.split('/t', 1)
# convert count (currently a string) to int
try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass
 
# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
 
# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s/t%s'% (word, count)

Test your code (cat data | map | sort | reduce)

I recommend to test your mapper.py and reducer.py scripts manually before using them in a MapReduce job. Otherwise your jobs might successfully complete but there will be no job result data at all or not the results you would have expected. If that happens, most likely it was you (or me) who screw up.

Here are some ideas on how to test the functionality of the Map and Reduce scripts.

 # very basic test
 hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py
 foo     1
 foo     1
 quux    1
 labs    1
 foo     1
 bar     1
 quux    1

 hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py | sort | /home/hadoop/reducer.py
 bar     1
 foo     3
 labs    1
 quux    2

 # using one of the ebooks as example input
 # (see below on where to get the ebooks)
 hadoop@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt | /home/hadoop/mapper.py
 The     1
 Project 1
 Gutenberg       1
 EBook   1
 of      1
 [...] 
 (you get the idea)

Running the Python Code on Hadoop

Download example input data

We will use three ebooks from Project Gutenberg for this example:

The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
The Notebooks of Leonardo Da Vinci
Ulysses by James Joyce

Download each ebook as plain text files in us-ascii encoding and store the uncompressed files in a temporary directory of choice, for example /tmp/gutenberg.

 hadoop@ubuntu:~$ ls -l /tmp/gutenberg/
 total 3592
 -rw-r--r-- 1 hadoop hadoop  674425 2007-01-22 12:56 20417-8.txt
 -rw-r--r-- 1 hadoop hadoop 1423808 2006-08-03 16:36 7ldvc10.txt
 -rw-r--r-- 1 hadoop hadoop 1561677 2004-11-26 09:48 ulyss12.txt
 hadoop@ubuntu:~#!/usr/bin/env python
 
import sys
 
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
    line = line.strip()
# split the line into words
    words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s/t%s' % (word, 1)

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More