Implement WordCount with Python on Hadoop
A simple explanation
In this example, we use Python to write a simple MapReduce program that runs on Hadoop, wordcount (reading text files and counting the word frequency of words). Here we will enter the word text input.txt and Python script into the/home/data/python/wordcount directory.
Cd/home/data/python/wordcount
VI Input.txt
Input:
There is no denying that
Hello python
Hello MapReduce
MapReduce is good
Ii. Writing the Map code
Here we create a mapper.py script that reads data from standard input (stdin), separates words by default in spaces, and then outputs Word machine frequency to standard output (stdout) by line, and the entire map process does not count the total number of occurrences of each word, but instead directly outputs "Word 1" To be counted as input to reduce, ensuring that the file is executable (chmod +x/home/data/python//wordcount/mapper.py).
Cd/home/data/python//wordcount
VI mapper.py
#!/usr/bin/env python
#-*-Coding:utf-8-*-
Import Sys
#输入为标准输入stdin
For line in Sys.stdin:
#删除开头和结尾的空格
Line = Line.strip ()
#以默认空格分隔行单词到words列表
words = Line.split ()
For word in words:
#输出所有单词, in the form "word, 1", as input to reduce
print ('%s\t%s '% (word,1))
#如下:
C. Write the reduce Code
Here we create a reducer.py script that reads the results of the mapper.py from standard input (stdin) and then counts the total number of occurrences of each word and outputs it to the standard output (stdout).
Ensure that the file is executable (chmod +x/home/data/python//wordcount/reducer.py)
Cd/home/data/python//wordcount
VI reducer.py
#!/usr/bin/env python
#-*-Coding:utf-8-*-
Import Sys
Current_word = None
Current_count = 0
Word = None
#获取标准输入, which is the output of the mapper.py
For line in Sys.stdin:
#删除开头和结尾的空格
Line = Line.strip ()
#解析mapper. py output as input to the program, with TAB as the delimiter
Word,count = Line.split (' \ t ', 1)
#转换count从字符型为整型
Try
count = Int (count)
Except ValueError:
#count不是数据时, ignore this line
Continue
#要求mapper the output of the. Py to do a sort operation in order to judge the connected word, Hadoop automatically sorts
if Current_word = = Word:
Current_count + = Count
Else
If Current_word:
#输出当前word统计结果到标准输出
Print ('%s\t%s '% (current_word,current_count))
Current_count = Count
Current_word = Word
#输出最后一个word统计
if Current_word = = Word
Print ('%s\%s '% (current_word,current_count))
#如下:
Iv. Local Test code
We can test locally before the Hadoop platform runs, verifying that the results of mapper.py and reducer.py are running correctly. Note: When testing reducer.py, you need to sort (sort) The output of the mapper.py, however, the Hadoop environment automatically implements sorting.
#在本地运行mapper. PY:
cd/home/data/python/wordcount/
#记得执行: chmod +x/home/data/python//wordcount/mapper.py
Cat Input.txt |./mapper.py
#在本地运行reducer. py
#记得执行:chmod +x/home/data/python//wordcount/reducer.py
Cat Input.txt |./mapper.py | sort-k1,1 |./reducer.py
V. Running code on the Hadoop platform
Run code in Hadoop if you've built a Hadoop cluster
1. Create a directory and upload files
First create a text file storage directory on HDFs, which I created as:/wordcound
HDFs Dfs-mkdir/wordcound
#将本地文件input. txt is uploaded to the/wordcount on HDFs.
Hadoop Fs-put/home/data/python/wordcount/input.txt/wordcount
Hadoop fs-ls/wordcount #查看在hdfs中 content in the/data/wordcount directory
2. Execute the MapReduce program
To simplify our command of implementing Hadoop MapReduce, we can add the Hadoop-streaming-3.0.0.jar of Hadoop to the system environment variable/etc/profile, in/etc/ Add the following configuration to the profile file:
First, import the Hadoop-streaming-3.0.0.jar in the configuration.
Vi/etc/profile
hadoop_stream= $HADOOP _home/share/hadoop/tools/lib/hadoop-streaming-3.0.0.jar
Export Hadoop_stream
source/etc/profile #刷新配置
#执行以下命令:
Hadoop jar $HADOOP _stream-file/home/data/python/wordcount/mapper.py-mapper./mapper.py-file/home/data/python/ Wordcount/reducer.py-reducer./reducer.py-input/wordcount-output/output/word1
Get:
Then, enter the following command to view the results:
Hadoop fs-ls/output/word1
Hadoop fs-cat/output/word1/part-00000 #查看分析结果
It can be found that the results are consistent with the previous test, so congratulations, you're done!
Implement WordCount with Python on Hadoop