Using python to write MapReduce functions -- Taking WordCount as an Example

Last Update:2014-07-07 Source: Internet

Author: User

Tags stream api

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Using python to write MapReduce functions -- Taking WordCount as an Example
Although the Hadoop framework is written in java, the Hadoop program is not limited to java, but can be used in python, C ++, ruby, and so on. In this example, write a MapReduce instance using python instead of using Jython to convert the python code into a jar file. The purpose of this example is to calculate the Word Frequency of the input file. Input: text file output: Text (each line includes the word frequency and Word Frequency, separated by '\ t') 1. the "trick" of using Python to write MapReduce in python MapReduce code is to use the Hadoop stream API to transmit data between the Map function and Reduce function through STDIN (standard input) and STDOUT (standard output. The only thing we need to do is to use Python's sys. stdin to read the input data and transmit our output to sys. stdout. The Hadoop stream will help us deal with anything else. 1.1 Map stage: mapper. py here, let's assume that the file is saved to the hadoop-0.20.2/test/code/mapper. py to copy the code #! /Usr/bin/env pythonimport sysfor line in sys. stdin: line = line. strip () words = line. split () for word in words: print "% s \ t % s" % (word, 1) copy the code file to read the file from STDIN. Cut the word and output the word and Word Frequency to STDOUT. The Map script does not calculate the total number of words, but outputs <word> 1. In our example, we asked the subsequent Reduce stage to do statistics. To make the script executable, add mapper. py executable permission chmod + x hadoop-0.20.2/test/code/mapper. py1.2 Reduce stage: CER Cer. py here, let's assume that the file is saved to the hadoop-0.20.2/test/code/CER Cer. py copy Code #! /Usr/bin/env pythonfrom operator import itemgetterimport sys current_word = Nonecurrent_count = 0 word = None for line in sys. stdin: line = line. strip () word, count = line. split ('\ t', 1) try: count = int (count) distinct t ValueError: # if count is not a number, ignore continue if current_word = word: current_count + = count else: if current_word: print "% s \ t % s" % (current_word, current_count) current_count = count cu Rrent_word = word if word = current_word: # Do not forget the final output print "% s \ t % s" % (current_word, current_count). Copying the code file reads mapper. the result of py is used as the CER Cer. and count the total number of times each word appears, and output the final result to STDOUT. To make the script executable, add CER Cer. py executable permission chmod + x hadoop-0.20.2/test/code/CER Cer. py details: split (chara, m), the role of the second parameter, the following example is awesome str = 'server = mpilgrim & ip = 10.10.10.10 & port = 8080 'print str. split ('=', 1) [0] #1 indicates that only print str is intercepted once. split ('=', 1) [1] print str. split ('=') [0] print str. split ('=') [1] Output 1234 servermpilgrim & ip = 10.10.10.10 & port = 8080 servermpilgrim & ip 1.3 Test code (cat data | map | sort | reduce) we recommend that you test m locally before submitting it to MapReduce job. Apper. py and CER Cer. py scripts. Otherwise, jobs may be successfully executed, but the results are not what you want. Functional testing mapper. py and reducer. py 1234567891011121314 [rte@hadoop-0.20.2] $ cd test/code [rte @ code] $ echo "foo quux labs foo bar quux" |. /mapper. pyfoo 1foo 1 quux 1 labs 1foo 1bar 1 quux 1 [rte @ code] $ echo "foo quux labs foo bar quux" |. /mapper. py | sort-k1, 1 |. /CER Cer. pybar 1foo 3 labs 1 quux 2 details: sort-k1, what is the meaning of the 1 parameter? -K,-key = POS1 [, POS2] keys start with pos1 and end with pos2. Sometimes sort is often used for sorting. preprocessing is required to put the field language to be sorted at the beginning. In fact, this is completely unnecessary. It is sufficient to use the-k parameter. For example, sort all 123451 42 33 24 15 0 if sort-k 2, the execution result is 123455 04 13 22 31 4 2. run python code 2.1 On Hadoop data preparation download Plain Text UTF-8Plain Text UTF-8Plain Text UTF-8 of the following three files I put the above three files in the hadoop-0.20.2/test/datas/directory 2.2 run copy local data files to the Distributed File System (HDFS. Bin/hadoop dfs-copyFromLocal/test/datas hdfs_in view bin/hadoop dfs-ls result 1drwxr-xr-x-rte supergroup 0 2014-07-05/user/rte/hdfs_in view specific file bin/hadoop dfs -ls/user/rte/hdfs_in: Execute MapReduce job bin/hadoop jar contrib/streaming/hadoop-* streaming *. jar \-file test/code/mapper. py-mapper test/code/mapper. py \-file test/code/CER Cer. py-reducer test/code/reducer. py \-input/user/rte/hdfs_in/*-outp Ut/user/rte/hdfs_out instance output check whether the output result is in the target directory/user/rte/hdfs_out bin/hadoop dfs-ls/user/rte/hdfs_out output 123 Found 2 itemsdrwxr- xr-x-rte supergroup 0/user/rte/hdfs_out2/_ logs-rw-r -- 2 rte supergroup 880829/user/rte/hdfs_out2/part- 00000 check the result bin/hadoop dfs-cat/user/rte/hdfs_out2/part-00000. The output has been achieved, however, python iterators and generators can be used for optimization. use python iterators and generators to optimize Mapper and Reducer code 3. 1. Check the 3.2 optimized Mapper and Reducer code mapper. py in python to copy the Code #! /Usr/bin/env pythonimport sysdef read_input (file): for line in file: yield line. split () def main (separator = '\ t'): data = read_input (sys. stdin) for words in data: for word in words: print "% s % d" % (word, separator, 1) if _ name _ = "_ main _": copy the CER code in main. py copy Code #! /Usr/bin/env pythonfrom operator import itemgetterfrom itertools import groupbyimport sys def read_mapper_output (file, separator = '\ t'): for line in file: yield line. rstrip (). split (separator, 1) def main (separator = '\ t'): data = read_mapper_output (sys. stdin, separator = separator) for current_word, group in groupby (data, itemgetter (0): try: total_count = sum (int (count) for current_word, count in group) print "% s % d" % (current_word, separator, total_count) failed t valueError: pass if _ name _ = "_ main _": main () current_word = Nonecurrent_count = 0 word = None for line in sys. stdin: line = line. strip () word, count = line. split ('\ t', 1) try: count = int (count) distinct t ValueError: continue if current_word = word: current_count + = count else: if current_word: print "% s \ t % s" % (current_word, current_count) current_count = count current_word = word if current_word = word: print "% s \ t % s" % (current_word, current_count)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More