Implement WordCount with Python on Hadoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Implement WordCount with Python on Hadoop

A simple explanation

In this example, we use Python to write a simple MapReduce program that runs on Hadoop, wordcount (reading text files and counting the word frequency of words). Here we will enter the word text input.txt and Python script into the/home/data/python/wordcount directory.

Cd/home/data/python/wordcount

VI Input.txt

Input:

There is no denying that

Hello python

Hello MapReduce

MapReduce is good

Ii. Writing the Map code

Here we create a mapper.py script that reads data from standard input (stdin), separates words by default in spaces, and then outputs Word machine frequency to standard output (stdout) by line, and the entire map process does not count the total number of occurrences of each word, but instead directly outputs "Word 1" To be counted as input to reduce, ensuring that the file is executable (chmod +x/home/data/python//wordcount/mapper.py).

Cd/home/data/python//wordcount

VI mapper.py

#!/usr/bin/env python
#-*-Coding:utf-8-*-

Import Sys

#输入为标准输入stdin

For line in Sys.stdin:

#删除开头和结尾的空格

Line = Line.strip ()

#以默认空格分隔行单词到words列表

words = Line.split ()

For word in words:

#输出所有单词, in the form "word, 1", as input to reduce

print ('%s\t%s '% (word,1))

#如下:

C. Write the reduce Code

Here we create a reducer.py script that reads the results of the mapper.py from standard input (stdin) and then counts the total number of occurrences of each word and outputs it to the standard output (stdout).

Ensure that the file is executable (chmod +x/home/data/python//wordcount/reducer.py)

Cd/home/data/python//wordcount

VI reducer.py

#!/usr/bin/env python

#-*-Coding:utf-8-*-

Import Sys

Current_word = None

Current_count = 0

Word = None

#获取标准输入, which is the output of the mapper.py

For line in Sys.stdin:

#删除开头和结尾的空格

Line = Line.strip ()

#解析mapper. py output as input to the program, with TAB as the delimiter

Word,count = Line.split (' \ t ', 1)

#转换count从字符型为整型

Try

　　count = Int (count)

Except ValueError:

#count不是数据时, ignore this line

Continue

#要求mapper the output of the. Py to do a sort operation in order to judge the connected word, Hadoop automatically sorts

if Current_word = = Word:

Current_count + = Count

Else

If Current_word:

#输出当前word统计结果到标准输出

Print ('%s\t%s '% (current_word,current_count))

Current_count = Count

Current_word = Word

#输出最后一个word统计

if Current_word = = Word

Print ('%s\%s '% (current_word,current_count))

#如下:

Iv. Local Test code

We can test locally before the Hadoop platform runs, verifying that the results of mapper.py and reducer.py are running correctly. Note: When testing reducer.py, you need to sort (sort) The output of the mapper.py, however, the Hadoop environment automatically implements sorting.

#在本地运行mapper. PY:

cd/home/data/python/wordcount/

#记得执行: chmod +x/home/data/python//wordcount/mapper.py

Cat Input.txt |./mapper.py

#在本地运行reducer. py

#记得执行:chmod +x/home/data/python//wordcount/reducer.py

Cat Input.txt |./mapper.py | sort-k1,1 |./reducer.py

V. Running code on the Hadoop platform

Run code in Hadoop if you've built a Hadoop cluster

1. Create a directory and upload files

First create a text file storage directory on HDFs, which I created as:/wordcound

HDFs Dfs-mkdir/wordcound

#将本地文件input. txt is uploaded to the/wordcount on HDFs.

Hadoop Fs-put/home/data/python/wordcount/input.txt/wordcount

Hadoop fs-ls/wordcount #查看在hdfs中 content in the/data/wordcount directory

2. Execute the MapReduce program

To simplify our command of implementing Hadoop MapReduce, we can add the Hadoop-streaming-3.0.0.jar of Hadoop to the system environment variable/etc/profile, in/etc/ Add the following configuration to the profile file:

First, import the Hadoop-streaming-3.0.0.jar in the configuration.

Vi/etc/profile

hadoop_stream= $HADOOP _home/share/hadoop/tools/lib/hadoop-streaming-3.0.0.jar

Export Hadoop_stream

source/etc/profile #刷新配置

#执行以下命令:

Hadoop jar $HADOOP _stream-file/home/data/python/wordcount/mapper.py-mapper./mapper.py-file/home/data/python/ Wordcount/reducer.py-reducer./reducer.py-input/wordcount-output/output/word1

Get:

Then, enter the following command to view the results:

Hadoop fs-ls/output/word1

Hadoop fs-cat/output/word1/part-00000 #查看分析结果

It can be found that the results are consistent with the previous test, so congratulations, you're done!

Implement WordCount with Python on Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Implement WordCount with Python on Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Implement WordCount with Python on Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support