Python Development MapReduce Series (i) WordCount Demo

Last Update:2017-09-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original, forwarding please indicate the source.

MapReduce is the core of the Hadoop elephant, and the core of data processing in Hadoop is the MapReduce program design model. A map/reduce job typically divides the input dataset into separate pieces of data that are processed in a completely parallel manner by the Map Task (Task) . The framework sorts the output of the map first and then inputs the results to the reduce task . Usually the inputs and outputs of the job are stored in the file system. Therefore, our programming center is mainly the mapper stage and the reducer stage.

Below to develop a mapreduce program from zero and run it on a Hadoop cluster.
Mapper Code map.py:

Import SYS          for inch Sys.stdin:         = Line.strip (). Split (")            for in  word_list:             Print ' \ t '

View Code

Reducer Code reduce.py:

 ImportSYS Cur_word=None sum=0 forLineinchSys.stdin:ss= Line.strip (). Split ('\ t')                ifLen (ss) < 2:            ContinueWord=Ss[0].strip () Count= Ss[1].strip ()ifCur_word = =None:cur_word=WordifCur_word! =Word:Print '\ t'. Join ([Cur_word, str (sum)]) Cur_word=Word sum=0 Sum+=Int (count)Print '\ t'. Join ([Cur_word, str (sum)]) sum= 0

View Code

Resource file Src.txt (for testing, when running in a cluster, remember to upload to HDFs):

Hello        ni hao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ao ni haoni hao ni hao Ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni haoao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao n I hao ni hao ni hao ni hao    dad would get out he mandolin and play for the family    Dad loved to play the mandolin F Or his family he knew we enjoyed singing    I had to mature into a man and has children of my own before I realized how Much he had sacrificed    I had to,mature into a man and,have children of my own before. I realized how much he had sacrificed

View Code

First local debugging to see if the results are correct, enter the following command:

Cat Src.txt | Python map.py | Sort-k 1 | Python reduce.py

Results from the command line output:

A    2    and    2    and,have    1    ao    1    before    1    before. I    1    Children    2    Dad    2    enjoyed    1    family    2    for    2    get    1    had    4    hao    haoao    1    haoni 3 have    1    He    3    Hello    1    his    2    what    2    I    3    into    2    knew    1    loved    1    man    2    mandolin    2    mature    1    much    2    my    2    ni    2 out    1    own    2    play    2    realized    2    sacrificed    2    singing    1    the    2    to    2    to,mature    1    We    1    would    1

View Code

Debug to find local debugging, the code is OK. Below the cluster to run. For convenience, specially wrote a script run.sh, Liberation Labor.

Hadoop_cmd="/home/hadoop/hadoop/bin/hadoop"Stream_jar_path="/home/hadoop/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar"Input_file_path="/home/input/src.txt"Output_path="/home/output"$HADOOP _cmd FS-RMR $OUTPUT _path $HADOOP _cmd jar $STREAM _jar_path-input $INPUT _file_path-Output $OUTPUT _path-mapper"python map.py"         -reducer"python reduce.py"         -file./map.py-file./reduce.py

The following script is parsed:

Path of the Hadoop_cmd:hadoop bin stream_jar_path:streaming the path to the JAR package input_file_path:hadoop The resource input path on the cluster output_path:h Adoop The result output path on the cluster. (Note: This directory should not exist, so the script is added first to delete this directory.) * * NOTE * * * * NOTE * * *: If the first execution, without this directory, will be an error. You can create a new output directory manually. ) $HADOOP _cmd FS-RMR $OUTPUT _path $HADOOP _cmd jar $STREAM _jar_path-input $INPUT _file_path-Output $OUTPUT _path-mapper"python map.py"         -reducer"python reduce.py"         -file./map.py-file./reduce.py#Here fixed format, specify input, Output path, specify Mapper,reducer file, and distribute Mapper,reducer role of our user write code file, because other nodes of the cluster do not have mapper, reducer executable file.

Enter the following command to view the records that were output after the reduce phase:

Cat Src.txt | Python map.py | Sort-k 1 | Python reduce.py | Wc-l
Command line output: 43

In Browser input: master:50030 View the details of the task.

Kind    % Complete    Num Tasks    Pending    Running    complete    killed     failed/killed Task Attemptsmap       100.00%        2            0        0        2            0            0/0reduce    100.00%        1            0        0        1            0            0/0

See this in the Map-reduce framework.

Counter                    Map    Reduce    totalreduce output records    0    0       43

Demonstrate the success of the process. The development of the first Hadoop program ends.

Python Development MapReduce Series (i) WordCount Demo

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Development MapReduce Series (i) WordCount Demo

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support