Implement WordCount with Python on Hadoop

Source: Internet
Author: User
Tags python script hdfs dfs hadoop mapreduce hadoop fs

Implement WordCount with Python on Hadoop

A simple explanation

In this example, we use Python to write a simple MapReduce program that runs on Hadoop, wordcount (reading text files and counting the word frequency of words). Here we will enter the word text input.txt and Python script into the/home/data/python/wordcount directory.

Cd/home/data/python/wordcount

VI Input.txt

Input:

There is no denying that

Hello python

Hello MapReduce

MapReduce is good

Ii. Writing the Map code

Here we create a mapper.py script that reads data from standard input (stdin), separates words by default in spaces, and then outputs Word machine frequency to standard output (stdout) by line, and the entire map process does not count the total number of occurrences of each word, but instead directly outputs "Word 1" To be counted as input to reduce, ensuring that the file is executable (chmod +x/home/data/python//wordcount/mapper.py).

Cd/home/data/python//wordcount

VI mapper.py

#!/usr/bin/env python
#-*-Coding:utf-8-*-

Import Sys

#输入为标准输入stdin

For line in Sys.stdin:

#删除开头和结尾的空格

Line = Line.strip ()

#以默认空格分隔行单词到words列表

words = Line.split ()

For word in words:

#输出所有单词, in the form "word, 1", as input to reduce

print ('%s\t%s '% (word,1))

#如下:

C. Write the reduce Code

Here we create a reducer.py script that reads the results of the mapper.py from standard input (stdin) and then counts the total number of occurrences of each word and outputs it to the standard output (stdout).

Ensure that the file is executable (chmod +x/home/data/python//wordcount/reducer.py)

Cd/home/data/python//wordcount

VI reducer.py

#!/usr/bin/env python

#-*-Coding:utf-8-*-

Import Sys

Current_word = None

Current_count = 0

Word = None

#获取标准输入, which is the output of the mapper.py

For line in Sys.stdin:

#删除开头和结尾的空格

Line = Line.strip ()

#解析mapper. py output as input to the program, with TAB as the delimiter

Word,count = Line.split (' \ t ', 1)

#转换count从字符型为整型

Try

  count = Int (count)

Except ValueError:

#count不是数据时, ignore this line

Continue

#要求mapper the output of the. Py to do a sort operation in order to judge the connected word, Hadoop automatically sorts

if Current_word = = Word:

Current_count + = Count

Else

If Current_word:

#输出当前word统计结果到标准输出

Print ('%s\t%s '% (current_word,current_count))

Current_count = Count

Current_word = Word

#输出最后一个word统计

if Current_word = = Word

Print ('%s\%s '% (current_word,current_count))

#如下:

Iv. Local Test code

We can test locally before the Hadoop platform runs, verifying that the results of mapper.py and reducer.py are running correctly. Note: When testing reducer.py, you need to sort (sort) The output of the mapper.py, however, the Hadoop environment automatically implements sorting.

#在本地运行mapper. PY:

cd/home/data/python/wordcount/

#记得执行: chmod +x/home/data/python//wordcount/mapper.py

Cat Input.txt |./mapper.py

#在本地运行reducer. py

#记得执行:chmod +x/home/data/python//wordcount/reducer.py

Cat Input.txt |./mapper.py | sort-k1,1 |./reducer.py

V. Running code on the Hadoop platform

Run code in Hadoop if you've built a Hadoop cluster

1. Create a directory and upload files

First create a text file storage directory on HDFs, which I created as:/wordcound

HDFs Dfs-mkdir/wordcound

#将本地文件input. txt is uploaded to the/wordcount on HDFs.

Hadoop Fs-put/home/data/python/wordcount/input.txt/wordcount

Hadoop fs-ls/wordcount #查看在hdfs中 content in the/data/wordcount directory

2. Execute the MapReduce program

To simplify our command of implementing Hadoop MapReduce, we can add the Hadoop-streaming-3.0.0.jar of Hadoop to the system environment variable/etc/profile, in/etc/ Add the following configuration to the profile file:

First, import the Hadoop-streaming-3.0.0.jar in the configuration.

Vi/etc/profile

hadoop_stream= $HADOOP _home/share/hadoop/tools/lib/hadoop-streaming-3.0.0.jar

Export Hadoop_stream

source/etc/profile #刷新配置

#执行以下命令:

Hadoop jar $HADOOP _stream-file/home/data/python/wordcount/mapper.py-mapper./mapper.py-file/home/data/python/ Wordcount/reducer.py-reducer./reducer.py-input/wordcount-output/output/word1

Get:

Then, enter the following command to view the results:

Hadoop fs-ls/output/word1

Hadoop fs-cat/output/word1/part-00000 #查看分析结果

It can be found that the results are consistent with the previous test, so congratulations, you're done!

Implement WordCount with Python on Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.