Using python to write MapReduce functions -- Taking WordCount as an Example

Source: Internet
Author: User
Tags stream api

Using python to write MapReduce functions -- Taking WordCount as an Example
Although the Hadoop framework is written in java, the Hadoop program is not limited to java, but can be used in python, C ++, ruby, and so on. In this example, write a MapReduce instance using python instead of using Jython to convert the python code into a jar file. The purpose of this example is to calculate the Word Frequency of the input file. Input: text file output: Text (each line includes the word frequency and Word Frequency, separated by '\ t') 1. the "trick" of using Python to write MapReduce in python MapReduce code is to use the Hadoop stream API to transmit data between the Map function and Reduce function through STDIN (standard input) and STDOUT (standard output. The only thing we need to do is to use Python's sys. stdin to read the input data and transmit our output to sys. stdout. The Hadoop stream will help us deal with anything else. 1.1 Map stage: mapper. py here, let's assume that the file is saved to the hadoop-0.20.2/test/code/mapper. py to copy the code #! /Usr/bin/env pythonimport sysfor line in sys. stdin: line = line. strip () words = line. split () for word in words: print "% s \ t % s" % (word, 1) copy the code file to read the file from STDIN. Cut the word and output the word and Word Frequency to STDOUT. The Map script does not calculate the total number of words, but outputs <word> 1. In our example, we asked the subsequent Reduce stage to do statistics. To make the script executable, add mapper. py executable permission chmod + x hadoop-0.20.2/test/code/mapper. py1.2 Reduce stage: CER Cer. py here, let's assume that the file is saved to the hadoop-0.20.2/test/code/CER Cer. py copy Code #! /Usr/bin/env pythonfrom operator import itemgetterimport sys current_word = Nonecurrent_count = 0 word = None for line in sys. stdin: line = line. strip () word, count = line. split ('\ t', 1) try: count = int (count) distinct t ValueError: # if count is not a number, ignore continue if current_word = word: current_count + = count else: if current_word: print "% s \ t % s" % (current_word, current_count) current_count = count cu Rrent_word = word if word = current_word: # Do not forget the final output print "% s \ t % s" % (current_word, current_count). Copying the code file reads mapper. the result of py is used as the CER Cer. and count the total number of times each word appears, and output the final result to STDOUT. To make the script executable, add CER Cer. py executable permission chmod + x hadoop-0.20.2/test/code/CER Cer. py details: split (chara, m), the role of the second parameter, the following example is awesome str = 'server = mpilgrim & ip = 10.10.10.10 & port = 8080 'print str. split ('=', 1) [0] #1 indicates that only print str is intercepted once. split ('=', 1) [1] print str. split ('=') [0] print str. split ('=') [1] Output 1234 servermpilgrim & ip = 10.10.10.10 & port = 8080 servermpilgrim & ip 1.3 Test code (cat data | map | sort | reduce) we recommend that you test m locally before submitting it to MapReduce job. Apper. py and CER Cer. py scripts. Otherwise, jobs may be successfully executed, but the results are not what you want. Functional testing mapper. py and reducer. py 1234567891011121314 [rte@hadoop-0.20.2] $ cd test/code [rte @ code] $ echo "foo quux labs foo bar quux" |. /mapper. pyfoo 1foo 1 quux 1 labs 1foo 1bar 1 quux 1 [rte @ code] $ echo "foo quux labs foo bar quux" |. /mapper. py | sort-k1, 1 |. /CER Cer. pybar 1foo 3 labs 1 quux 2 details: sort-k1, what is the meaning of the 1 parameter? -K,-key = POS1 [, POS2] keys start with pos1 and end with pos2. Sometimes sort is often used for sorting. preprocessing is required to put the field language to be sorted at the beginning. In fact, this is completely unnecessary. It is sufficient to use the-k parameter. For example, sort all 123451 42 33 24 15 0 if sort-k 2, the execution result is 123455 04 13 22 31 4 2. run python code 2.1 On Hadoop data preparation download Plain Text UTF-8Plain Text UTF-8Plain Text UTF-8 of the following three files I put the above three files in the hadoop-0.20.2/test/datas/directory 2.2 run copy local data files to the Distributed File System (HDFS. Bin/hadoop dfs-copyFromLocal/test/datas hdfs_in view bin/hadoop dfs-ls result 1drwxr-xr-x-rte supergroup 0 2014-07-05/user/rte/hdfs_in view specific file bin/hadoop dfs -ls/user/rte/hdfs_in: Execute MapReduce job bin/hadoop jar contrib/streaming/hadoop-* streaming *. jar \-file test/code/mapper. py-mapper test/code/mapper. py \-file test/code/CER Cer. py-reducer test/code/reducer. py \-input/user/rte/hdfs_in/*-outp Ut/user/rte/hdfs_out instance output check whether the output result is in the target directory/user/rte/hdfs_out bin/hadoop dfs-ls/user/rte/hdfs_out output 123 Found 2 itemsdrwxr- xr-x-rte supergroup 0/user/rte/hdfs_out2/_ logs-rw-r -- 2 rte supergroup 880829/user/rte/hdfs_out2/part- 00000 check the result bin/hadoop dfs-cat/user/rte/hdfs_out2/part-00000. The output has been achieved, however, python iterators and generators can be used for optimization. use python iterators and generators to optimize Mapper and Reducer code 3. 1. Check the 3.2 optimized Mapper and Reducer code mapper. py in python to copy the Code #! /Usr/bin/env pythonimport sysdef read_input (file): for line in file: yield line. split () def main (separator = '\ t'): data = read_input (sys. stdin) for words in data: for word in words: print "% s % d" % (word, separator, 1) if _ name _ = "_ main _": copy the CER code in main. py copy Code #! /Usr/bin/env pythonfrom operator import itemgetterfrom itertools import groupbyimport sys def read_mapper_output (file, separator = '\ t'): for line in file: yield line. rstrip (). split (separator, 1) def main (separator = '\ t'): data = read_mapper_output (sys. stdin, separator = separator) for current_word, group in groupby (data, itemgetter (0): try: total_count = sum (int (count) for current_word, count in group) print "% s % d" % (current_word, separator, total_count) failed t valueError: pass if _ name _ = "_ main _": main () current_word = Nonecurrent_count = 0 word = None for line in sys. stdin: line = line. strip () word, count = line. split ('\ t', 1) try: count = int (count) distinct t ValueError: continue if current_word = word: current_count + = count else: if current_word: print "% s \ t % s" % (current_word, current_count) current_count = count current_word = word if current_word = word: print "% s \ t % s" % (current_word, current_count)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.